Dependence: Theory and Practice Allen and Kennedy, Chapter 2 Liza Fireman.

Slides:



Advertisements
Similar presentations
Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.
Advertisements

Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures.
Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.
1 ECE734 VLSI Arrays for Digital Signal Processing Loop Transformation.
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
COSC513 Operating System Research Paper Fundamental Properties of Programming for Parallelism Student: Feng Chen (134192)
Using the Iteration Space Visualizer in Loop Parallelization Yijun YU
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.
Instruction-Level Parallel Processors {Objective: executing two or more instructions in parallel} 4.1 Evolution and overview of ILP-processors 4.2 Dependencies.
Carnegie Mellon Lecture 7 Instruction Scheduling I. Basic Block Scheduling II.Global Scheduling (for Non-Numeric Code) Reading: Chapter 10.3 – 10.4 M.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
Code Generation Steve Johnson. May 23, 2005Copyright (c) Stephen C. Johnson The Problem Given an expression tree and a machine architecture, generate.
Compiler Challenges for High Performance Architectures
Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.
Dependence Analysis Kathy Yelick Bebop group meeting, 8/3/01.
Optimizing Compilers for Modern Architectures Preliminary Transformations Chapter 4 of Allen and Kennedy.
Preliminary Transformations Chapter 4 of Allen and Kennedy Harel Paz.
Parallel Scheduling of Complex DAGs under Uncertainty Grzegorz Malewicz.
Parallel and Cluster Computing 1. 2 Optimising Compilers u The main specific optimization is loop vectorization u The compilers –Try to recognize such.
Enhancing Fine- Grained Parallelism Chapter 5 of Allen and Kennedy Mirit & Haim.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
CMPUT680 - Fall 2006 Topic A: Data Dependence in Loops José Nelson Amaral
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
Stanford University CS243 Winter 2006 Wei Li 1 Data Dependences and Parallelization.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky.
Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.
CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral
Data Flow Analysis Compiler Design Nov. 8, 2005.
Data Dependences CS 524 – High-Performance Computing.
EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.
Improving Code Generation Honors Compilers April 16 th 2002.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
Creating Coarse-Grained Parallelism Chapter 6 of Allen and Kennedy Dan Guez.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp
Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.
09/15/2011CS4961 CS4961 Parallel Programming Lecture 8: Dependences and Locality Optimizations Mary Hall September 15,
Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.
1 Code Generation Part II Chapter 9 COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University, 2005.
RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.
Carnegie Mellon Lecture 14 Loop Optimization and Array Analysis I. Motivation II. Data dependence analysis Chapter , 11.6 Dror E. MaydanCS243:
1 Theory and Practice of Dependence Testing Data and control dependences Scalar data dependences  True-, anti-, and output-dependences Loop dependences.
High-Level Transformations for Embedded Computing
Advanced Compiler Techniques LIU Xianhua School of EECS, Peking University Loops.
Program Analysis & Transformations Loop Parallelization and Vectorization Toheed Aslam.
Optimizing Compilers for Modern Architectures Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9.
Optimizing Compilers for Modern Architectures Enhancing Fine-Grained Parallelism Part II Chapter 5 of Allen and Kennedy.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Memory-Aware Compilation Philip Sweany 10/20/2011.
10/01/2009CS4961 CS4961 Parallel Programming Lecture 12/13: Introduction to Locality Mary Hall October 1/3,
Dependence Analysis and Loops CS 3220 Spring 2016.
Lecture 38: Compiling for Modern Architectures 03 May 02
Dependence Analysis Important and difficult
Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)
Parallelizing Loops Moreno Marzolla
Register Pressure Guided Unroll-and-Jam
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Introduction to Optimization
Presentation transcript:

Dependence: Theory and Practice Allen and Kennedy, Chapter 2 Liza Fireman

Optimizations  On scalar machines, the principal optimizations are register allocation, instruction scheduling, and reducing the cost of array address calculations.  The optimization process in parallel computers includes finding parallelism in a sequential code.  Parallel tasks perform similar operations on different elements of the data arrays.

Fortran Loops DO I=1,N A(I)=B(I)+C ENDDO A(1:N)=B(1:N)+C

Fortran Loops DO I=1,N A(I+1)=A(I)+B(I) ENDDO A(2:N+1)=A(1:N)+B(1:N)

Bernstein’s conditions  When is it safe to run two tasks R1 and R2 in parallel? If none of the following holds: 1.R1 writes into a memory location that R2 reads 2.R2 writes into a memory location that R1 reads 3.Both R1 and R2 write to the same memory location

Assumption  The compiler must have the ability to determine whether two different iterations access the same memory location.

Data dependence S 1 PI = 3.14 S 2 R = 5.0 S 3 AREA = PI * R ** 2  Data dependences from S 1 and S 2 to S 3.  No execution constraint between S 1 and S 2. S 1 S 2 S 3 S 2 S 1 S 3 ✔ ✔ S2S2 S3S3 S1S1

Control dependence S 1 IF (T != 0.0) Goto S 3 S 2 A = A / T S 3 CONTINUE  S 2 can not be executed before S 1.

Data dependence  We can not interchange loads and stores to the same location.  There is a Data dependence between S 1 and S 2 iff (1) S 1 and S 2 access the same memory location and at least one of the stores into it (2) there is a feasible run-time execution path from S 1 to S 2.

 True dependence (RAW(  Antidependence (WAR) S 1 X = … S 2 … = X S 1 … = X S 2 X = … S1S2S1S2 S 1  -1 S 2 S1S1 S2S2  S1S1 S2S2  -1

 Output dependence (WAW) S 1 X = … S 2 X = … S 1 X = 3 … S 3 X = 5 S 4 W = X * Y S1S1 S2S2 oo S1oS2S1oS2

DO I=1,N S A(I+1)=A(I)+B(I) ENDDO DO I=1,N S A(I+2)=A(I)+B(I) ENDDO S 

Iteration number  When I=L the iteration number is 1.  When I=L+S the iteration number is 2.  When I=L+J*S the iteration number is J+1. DO I=L,U,S … ENDDO

Nesting level In a loop nest, the nesting level of a specific loop is equal to the number of loop that enclose it. DO I=1,N DO J=1,M DO K=1,L S A(I+1,J,K)=A(I,J,K)+10 ENDDO

Iteration Vector Given a nest of n loops, the iteration vector of particular iteration is a vector i={i 1,i 2,…i n } where i k 1≤k≤n represents the iteration number for the loop at nesting level k. DO I=1,2 DO J=1,3 S … ENDDO S[(2,1)] {(1,1),(1,2), (1,3),(2,1), (2,2),(2,3)}

Iteration Vector – Lexicographic order Iteration i precedes iteration j, denoted i < j, iff (1) i[1:n-1] < j[1:n-1] Or (2) i[1:n-1]=j[1:n-1] and i n < j n

Loop dependence There exists a dependence from statement S1 to S2 in a common nest of loops iff there exist two iteration vectors i and j such that (1) i < j or i = j and there is a path from S1 to S2 in the body of the loop (2) S1 accesses memory location M on iteration i and S2 accesses M on iteration j (3) one of these accesses is a write

Transformations  A transformation is “safe” if the transformed program has the same “meaning”.  The transformation should have no effect of the outputs.  A reordering transformation is any program transformation that merely changes the order of execution, without adding or deleting any executions of any statements.

Transformations – cont.  A reordering transformation does not eliminate dependences  It can change the ordering of the dependence but it will lead to incorrect behavior  A reordering transformation preserves a dependence if it preserves the relative execution order of the source and sink of that dependence.

Fundamental Theorem of Dependence Any reordering transformation that preserves every dependence in a program preserves the meaning of that program  A transformation is said to be valid for the program to which it applies if it preserves all dependences in the program.

Example L 0 DO I=1,N L 1 DO J=1,2 S 0 A(I,J)=A(I,J)+B ENDDO S 1 T=A(I,1) S 2 A(I,1)=A(I,2) S 3 A(I,2)=T ENDDO

Distance Vector  Consider a dependence in a loop nest of n loops oStatement S 1 on iteration i is the source of the dependence oStatement S 2 on iteration j is the sink of the dependence  The distance vector is a vector of length n d(i,j) such that: d(i,j) k = j k - i k  Note that the iteration vectors are normalized.

Example DO J=1,10 DO I=1,99 S 1 A(I,J)=B(I,J)+X S 2 C(I,J)=A(100-I,J)+Y ENDDO There are total of 50 different distances for true dependences from S 1 to S 2 And 49 distances for anti-dependences from S 2 to S 1

Direction Vector Suppose that there is a dependence from statement S 1 on iteration i of a loop nest of n loops and statement S 2 on iteration j, then the dependence direction vector is D(i,j) is defined as a vector of length n such that “ 0 “=” if d(i,j) k = 0 “>” if d(i,j) k < 0 D(i,j) k =

Distance and Direction Vectors - Example DO I=1,N DO J=1,M DO K=1,L S A(I+1,J,K-1)=A(I,J,K)+10 ENDDO S has a true dependence on itself Distance Vector: (1, 0, -1) Direction Vector: ( ) S 

Distance and Direction Vectors – cont.  A dependence cannot exist if it has a direction vector whose leftmost non "=" component is not "<" as this would imply that the sink of the dependence occurs before the source.

Direction Vector Transformation  Let T be a transformation that is applied to a loop nest and that does not rearrange the statements in the body of the loop. Then the transformation is valid if, after it is applied, none of the direction vectors for dependences with source and sink in the nest has a leftmost non- “=” component that is “>”.  Follows from Fundamental Theorem of Dependence: oAll dependences exist oNone of the dependences have been reversed

Loop-carried Dependence Statement S 2 has a loop-carried dependence on statement S 1 if and only if S 1 references location M on iteration i, S 2 references M on iteration j and d(i,j) > 0 (that is, D(i,j) contains a “<” as leftmost non “=” component). In other words, when S 1 and S 2 execute on different iterations. DO I=1,N S 1 A(I+1)=F(I) S 2 F(I+1)=A(I) ENDDO

Level of Loop-carried Dependence  Level of a loop-carried dependence is the index of the leftmost non-“=” of D(i,j) for the dependence.  A level-k dependence between S 1 and S 2 is denoted by S 1  k S 2 DO I=1,N DO J=1,M DO K=1,L S A(I,J,K+1)=A(I,J,K)+10 ENDDO Direction vector for S1 is (=, =, <) Level of the dependence is 3 S 33

Loop-carried Transformations  Any reordering transformation that does not alter the relative order of any loops in the nest and preserves the iteration order of the level-k loop preserves all level-k dependences.  Proof: –D(i, j) has a “<” in the k th position and “=” in positions 1 through k-1  Source and sink of dependence are in the same iteration of loops 1 through k-1  Cannot change the sense of the dependence by a reordering of iterations of those loops

Loop-carried Transformations – cont. DO I=1,N S 1 A(I+1)=F(I) S 2 F(I+1)=A(I) ENDDO DO I=1,N S 1 F(I+1)=A(I) S 2 A(I+1)=F(I) ENDDO

DO I=1,N DO J=1,M DO K=1,L S A(I,J,K+1)=A(I,J,K)+10 ENDDO DO I=1,N DO J=M,1 DO K=1,L S A(I,J,K+1)=A(I,J,K)+10 ENDDO

Loop-independent Dependence Statement S 2 has a loop-independent dependence on statement S 1 if and only if there exist two iteration vectors i and j such that: 1) Statement S 1 refers to memory location M on iteration i, S 2 refers to M on iteration j, and i = j. 2) There is a control flow path from S 1 to S 2 within the iteration. In other words, when S 1 and S 2 execute on the same iteration.

Loop-independent Dependence - Example  S 2 depends on S 1 with a loop independent dependence is denoted by S 1   S 2  Note that the direction vector will have entries that are all “=” for loop independent dependences DO I=1,N S 1 A(I)=... S 2...=A(I) ENDDO S1S1 S2S2 ∞∞

Loop-independent Dependence - Example DO I=1,9 S 1 A(I)=... S 2...=A(10-I) ENDDO

Loop-independent Dependence - Example DO I=1,10 S 1 A(I)=... ENDDO DO I=1,10 S 2...=A(20-I) ENDDO

Loop-independent Dependence Transformation  If there is a loop-independent dependence from S 1 to S 2, any reordering transformation that does not move statement instances between iterations and preserves the relative order of S 1 and S 2 in the loop body preserves that dependence.  Follows from Fundamental Theorem of Dependence: oAll dependences exist oNone of the dependences have been reversed

Simple Dependence Testing DO I 1 =L 1,U 1,S 1 DO I 2 =L 2,U 2,S 2... DO I n =L n,U n,S n S 1 A(f 1 (i 1,…,i n ),…, f m (i 1,i n ))=... S 2...=A(g 1 (i 1,…i n ),…,g m (i 1,…i n )) ENDDO... ENDDO

Loop-independent Dependence Transformation A dependence exists from S 1 to S 2 if and only if there exist values of a and b such that (1) a is lexicographically less than or equal to b and (2) the following system of dependence equations is satisfied: f i (a) = g i (b) for all i, 1  i  m

Simple Dependence Testing: Delta Notation  Notation represents index values at the source and sink  Iteration at source denoted by: I 0  Iteration at sink denoted by: I 0 +  I  Forming an equality gets us: I = I 0 +  I  Solving this gives us:  I = 1  Carried dependence with distance vector (1) and direction vector (<) DO I=1,N S A(I+1)=A(I)+B ENDDO

Simple Dependence Testing: Delta Notation DO I=1,100 DO J=1,100 DO K=1,100 S A(I+1,J,K)=A(I,J,K+1)+10 ENDDO I = I 0 +  I; J 0 = J 0 +  J; K 0 = K 0 +  K + 1  I=1;  J=0;  K=-1 Corresponding direction vector: ( )

Simple Dependence Testing: Delta Notation - cont. If a loop index does not appear, its distance is unconstrained and its direction is “*” DO I=1,100 DO J=1,100 S A(I+1)=A(I)+B(J) ENDDO Corresponding direction vector: (<,*)

Simple Dependence Testing: Delta Notation – cont. DO J=1,100 DO I=1,100 S A(I+1)=A(I)+B(J) ENDDO Corresponding direction vector: (*,<) (*,, <) } (>, <) denotes a level 1 antidependence with direction vector ( )

Parallelization and Vectorization It is valid to convert a sequential loop to a parallel loop if the loop carries no dependence. DO I=1,N X(I)=X(I)+C ENDDO X(1:N)=X(1:N)+C ✔

Parallelization and Vectorization DO I=1,N X(I+1)=X(I)+C ENDDO X(2:N+1)=X(1:N)+C ✘

Loop Distribution  Can statements in loops which carry dependences be vectorized? DO I=1,N S 1 A(I+1)=B(I)+C S 2 D(I)=A(I)+E ENDDO S1S2S1S2 DO I=1,N S 1 A(I+1)=B(I)+C ENDDO DO I=1,N S 2 D(I)=A(I)+E ENDDO A(2:N+1)=B(1:N)+C D(1:N)=A(1:N)+E

Loop Distribution – cont.  Loop distribution fails if there is a cycle of dependences DO I=1,N S 1 A(I+1)=B(I)+C S 2 B(I+1)=A(I)+E ENDDO S 1  1 S 2 and S 2  1 S 1 S1S1 S2S2  

Example DO I=1,100 S 1 X(I)=Y(I)+10 DO J=1,100 S 2 B(J)=A(J,N) DO K=1,100 S 3 A(J+1,K)=B(J)+C(J,K) ENDDO S 4 Y(I+J)=A(J+1,N) ENDDO

DO I=1,100 S 1 X(I)=Y(I)+10 DO J=1,100 S 2 B(J)=A(J,N) DO K=1,100 S 3 A(J,K+1)=A(J+1,N)+10 ENDDO S 4 Y(I+J)=A(J+1,N) ENDDO Advanced Vectorization Algorithm – cont. S1S1 S2S2 S3S3 S4S4 11 1010 1010 1010 ∞∞  2,  1,  1 -1  1,  1 -1  1 -1 11 ∞∞

Simple Vectorization Algorithm procedure vectorize (L, D) // L is the maximal loop nest // D is the dependence graph for statements in L find the set {S 1, S 2,..., S m } of maximal strongly-connected components in D reducing each S i to a single node compute the dependence graph naturally induced use topological sort for i = 1 to m do begin if S i is a dependence cycle then generate a DO-loop around the statements in S i ; else directly rewrite S i vectorized with respect to every loop containing it; end end vectorize

DO I=1,100 S 1 X(I)=Y(I)+10 DO J=1,100 S 2 B(J)=A(J,N) DO K=1,100 S 3 A(J,K+1)=B(J)+C(J,K) ENDDO S 4 Y(I+J)=A(J+1,N) ENDDO Simple Vectorization Algorithm – cont. 1010 1010 1010 S1S1 S2S2 S3S3 S4S4 11 ∞∞  2,  1,  1 -1  1,  1 -1  1 -1 11 ∞∞

DO I=1,100 DO J=1,100 S 2 B(J)=A(J,N) DO K=1,100 S 3 A(J,K+1)=B(J)+C(J,K) ENDDO S 4 Y(I+J)=A(J+1,N) ENDDO X(1:100)=Y(1:100)+10 Simple Vectorization Algorithm – cont. 1010 1010 1010 S1S1 S2S2 S3S3 S4S4 11 ∞∞  2,  1,  1 -1  1,  1 -1  1 -1 11 ∞∞

Problems With Simple Vectorization DO I=1,N DO J=1,M S A(I+1,J)=A(I,J)+B ENDDO Dependence from S to itself with d(i, j) = (1,0) DO I=1,N S A(I+1,1:M)=A(I,1:M)+B ENDDO The simple algorithm does not capitalize on such opportunities S 11

Advanced Vectorization Algorithm procedure codegen(R, k, D); // R is the region for which we must generate code. // k is the minimum nesting level of possible parallel loops. // D is the dependence graph among statements in R.. find the set {S 1, S 2,..., S m } of maximal strongly-connected components inD restricted to R; reduce each Si to a single node and compute the dependence graph induced use topological sort for i = 1 to m do begin if S i is cyclic then begin generate a level-k DO statement; let D i be the dependence graph S i eliminating all dependence edges that are at level k codegen (R i, k+1, D i ); generate the level-k ENDDO statement; end else generate a vector statement for S i in r(S i )-k+1 dimensions, where r(S i ) is the number of loops containing Si; end

Advanced Vectorization Algorithm – cont. DO I=1,100 S 1 X(I)=Y(I)+10 DO J=1,100 S 2 B(J)=A(J,N) DO K=1,100 S 3 A(J,K+1)=B(J)+C(J,K) ENDDO S 4 Y(I+J)=A(J+1,N) ENDDO

DO I=1,100 S 1 X(I)=Y(I)+10 DO J=1,100 S 2 B(J)=A(J,N) DO K=1,100 S 3 A(J,K+1)=A(J+1,N)+10 ENDDO S 4 Y(I+J)=A(J+1,N) ENDDO Advanced Vectorization Algorithm – cont. S1S1 S2S2 S3S3 S4S4 11 1010 1010 1010 ∞∞  2,  1,  1 -1  1,  1 -1  1 -1 11 ∞∞

DO I=1,100 codegen({S2,S3,S4},2) ENDDO X(1:100)=Y(1:100)+10 Advanced Vectorization Algorithm – cont. 1010 1010 1010 S1S1 S2S2 S3S3 S4S4 11 ∞∞  2,  1,  1 -1  1,  1 -1  1 -1 11 ∞∞

DO I=1,100 codegen({S2,S3,S4},2) ENDDO X(1:100)=Y(1:100)+10 Advanced Vectorization Algorithm – cont. S2S2 S3S3 S4S4 ∞∞ 22 ∞∞

DO I=1,100 DO J=1,100 codegen({S2,S3},3) ENDDO Y(I+1:I+100)=A(2:101,N) ENDDO X(1:100)=Y(1:100)+10 Advanced Vectorization Algorithm – cont. S2S2 S3S3 S4S4 ∞∞ 22 ∞∞

DO I=1,100 DO J=1,100 B(J) = A(J,N) A(J+1,1:100)=B(J)+C(J,1:100) ENDDO Y(I+1:I+100)=A(2:101,N) ENDDO X(1:100)=Y(1:100)+10 S2S2 S3S3 ∞∞