Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.

Slides:

Advertisements

Similar presentations

Optimizing Compilers for Modern Architectures Copyright, 1996 © Dale Carnegie & Associates, Inc. Dependence Testing Allen and Kennedy, Chapter 3 thru Section.

Advertisements

Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

1 ECE734 VLSI Arrays for Digital Signal Processing Loop Transformation.

Using the Iteration Space Visualizer in Loop Parallelization Yijun YU

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.

Carnegie Mellon Lecture 7 Instruction Scheduling I. Basic Block Scheduling II.Global Scheduling (for Non-Numeric Code) Reading: Chapter 10.3 – 10.4 M.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

Compiler Challenges for High Performance Architectures

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

Dependence Analysis Kathy Yelick Bebop group meeting, 8/3/01.

Optimizing Compilers for Modern Architectures Preliminary Transformations Chapter 4 of Allen and Kennedy.

Preliminary Transformations Chapter 4 of Allen and Kennedy Harel Paz.

Parallel Scheduling of Complex DAGs under Uncertainty Grzegorz Malewicz.

Parallel and Cluster Computing 1. 2 Optimising Compilers u The main specific optimization is loop vectorization u The compilers –Try to recognize such.

Enhancing Fine- Grained Parallelism Chapter 5 of Allen and Kennedy Mirit & Haim.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

CMPUT680 - Fall 2006 Topic A: Data Dependence in Loops José Nelson Amaral

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

Stanford University CS243 Winter 2006 Wei Li 1 Data Dependences and Parallelization.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral

Dependence Testing Optimizing Compilers for Modern Architectures, Chapter 3 Allen and Kennedy Presented by Rachel Tzoref and Rotem Oshman.

Data Flow Analysis Compiler Design Nov. 8, 2005.

A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.

Data Dependences CS 524 – High-Performance Computing.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.

Data Flow Analysis Compiler Design Nov. 8, 2005.

Creating Coarse-Grained Parallelism Chapter 6 of Allen and Kennedy Dan Guez.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp

Dependence: Theory and Practice Allen and Kennedy, Chapter 2 Liza Fireman.

Optimizing Compilers for Modern Architectures Compiling High Performance Fortran Allen and Kennedy, Chapter 14.

The Theory of NP-Completeness 1. What is NP-completeness? Consider the circuit satisfiability problem Difficult to answer the decision problem in polynomial.

09/15/2011CS4961 CS4961 Parallel Programming Lecture 8: Dependences and Locality Optimizations Mary Hall September 15,

08/26/2010CS4961 CS4961 Parallel Programming Lecture 2: Introduction to Parallel Algorithms Mary Hall August 26,

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

Array Dependence Analysis COMP 621 Special Topics By Nurudeen Lameed

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.

Carnegie Mellon Lecture 14 Loop Optimization and Array Analysis I. Motivation II. Data dependence analysis Chapter , 11.6 Dror E. MaydanCS243:

Chapter 10 Algorithm Analysis.  Introduction  Generalizing Running Time  Doing a Timing Analysis  Big-Oh Notation  Analyzing Some Simple Programs.

1 Theory and Practice of Dependence Testing Data and control dependences Scalar data dependences  True-, anti-, and output-dependences Loop dependences.

Data Structure Introduction.

Flows in Planar Graphs Hadi Mahzarnia. Outline O Introduction O Planar single commodity flow O Multicommodity flows for C 1 O Feasibility O Algorithm.

Copyright © Cengage Learning. All rights reserved.

Program Analysis & Transformations Loop Parallelization and Vectorization Toheed Aslam.

Time Complexity. Solving a computational program Describing the general steps of the solution –Algorithm’s course Use abstract data types and pseudo code.

Optimizing Compilers for Modern Architectures Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9.

Optimizing Compilers for Modern Architectures Enhancing Fine-Grained Parallelism Part II Chapter 5 of Allen and Kennedy.

10/01/2009CS4961 CS4961 Parallel Programming Lecture 12/13: Introduction to Locality Mary Hall October 1/3,

NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. Fast.

Approximation Algorithms based on linear programming.

Dependence Analysis and Loops CS 3220 Spring 2016.

The Theory of NP-Completeness

CS314 – Section 5 Recitation 13

Parallel Algorithms (chap. 30, 1st edition)

Dependence Analysis Important and difficult

Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)

Parallelizing Loops Moreno Marzolla

Parallelization, Compilation and Platforms 5LIM0

Asymptotic Notations Algorithms Lecture 9.

ICS 353: Design and Analysis of Algorithms

Register Pressure Guided Unroll-and-Jam

Introduction to Optimization

Discrete Mathematics CS 2610

Presentation transcript:

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice What shall we cover in this chapter? Introduction to Dependences Loop-carried and Loop-independent Dependences Simple Dependence Testing Parallelization and Vectorization

Optimizing Compilers for Modern Architectures The Big Picture What are our goals? Simple Goal: Make execution time as small as possible Which leads to: Achieve execution of many (all, in the best case) instructions in parallel Find independent instructions

Optimizing Compilers for Modern Architectures Dependences We will concentrate on data dependences Chapter 7 deals with control dependences Simple example of data dependence: S 1 PI = 3.14 S 2 R = 5.0 S 3 AREA = PI * R ** 2 Statement S 3 cannot be moved before either S 1 or S 2 without compromising correct results

Optimizing Compilers for Modern Architectures Dependences Formally: There is a data dependence from statement S 1 to statement S 2 (S 2 depends on S 1 ) if: 1. Both statements access the same memory location and at least one of them stores onto it, and 2. There is a feasible run-time execution path from S 1 to S 2

Optimizing Compilers for Modern Architectures Load Store Classification Quick review of dependences classified in terms of load-store order: 1. True dependences (RAW hazard) –S 2 depends on S 1 is denoted by S 1  S 2 2. Antidependence (WAR hazard) –S 2 depends on S 1 is denoted by S 1  -1 S 2 3. Output dependence (WAW hazard) –S 2 depends on S 1 is denoted by S 1  0 S 2

Optimizing Compilers for Modern Architectures Dependence in Loops Let us look at two different loops: DO I = 1, N S 1 A(I+1) = A(I) + B(I) ENDDO DO I = 1, N S 1 A(I+2) = A(I) + B(I) ENDDO In both cases, statement S 1 depends on itself However, there is a significant difference We need a formalism to describe and distinguish such dependences

Optimizing Compilers for Modern Architectures Iteration Numbers The iteration number of a loop is equal to the value of the loop index Definition: —For an arbitrary loop in which the loop index I runs from L to U in steps of S, the iteration number i of a specific iteration is equal to the index value I on that iteration Example: DO I = 0, 10, 2 S 1 ENDDO

Optimizing Compilers for Modern Architectures Iteration Vectors What do we do for nested loops? Need to consider the nesting level of a loop Nesting level of a loop is equal to one more than the number of loops that enclose it. Given a nest of n loops, the iteration vector i of a particular iteration of the innermost loop is a vector of integers that contains the iteration numbers for each of the loops in order of nesting level. Thus, the iteration vector is: {i 1, i 2,..., i n } where i k, 1  k  m represents the iteration number for the loop at nesting level k

Optimizing Compilers for Modern Architectures Iteration Vectors Example: DO I = 1, 2 DO J = 1, 2 S 1 ENDDO The iteration vector S 1 [(2, 1)] denotes the instance of S 1 executed during the 2nd iteration of the I loop and the 1st iteration of the J loop

Optimizing Compilers for Modern Architectures Ordering of Iteration Vectors Iteration Space: The set of all possible iteration vectors for a statement Example: DO I = 1, 2 DO J = 1, 2 S 1 ENDDO The iteration space for S 1 is { (1,1), (1,2), (2,1), (2,2) }

Optimizing Compilers for Modern Architectures Ordering of Iteration Vectors Useful to define an ordering for iteration vectors Define an intuitive, lexicographic order Iteration i precedes iteration j, denoted i < j, iff: 1. i[i:n-1] < j[1:n-1], or 2. i[1:n-1] = j[1:n-1] and i n < j n

Optimizing Compilers for Modern Architectures Formal Definition of Loop Dependence Theorem 2.1 Loop Dependence: There exists a dependence from statements S 1 to statement S 2 in a common nest of loops if and only if there exist two iteration vectors i and j for the nest, such that (1) i < j or i = j and there is a path from S 1 to S 2 in the body of the loop, (2) statement S 1 accesses memory location M on iteration i and statement S 2 accesses location M on iteration j, and (3) one of these accesses is a write. Follows from the definition of dependence

Optimizing Compilers for Modern Architectures Transformations We call a transformation safe if the transformed program has the same "meaning" as the original program But, what is the "meaning" of a program? For our purposes: Two computations are equivalent if, on the same inputs: —They produce the same outputs in the same order

Optimizing Compilers for Modern Architectures Reordering Transformations A reordering transformation is any program transformation that merely changes the order of execution of the code, without adding or deleting any executions of any statements

Optimizing Compilers for Modern Architectures Properties of Reordering Transformations A reordering transformation does not eliminate dependences However, it can change the ordering of the dependence which will lead to incorrect behavior A reordering transformation preserves a dependence if it preserves the relative execution order of the source and sink of that dependence.

Optimizing Compilers for Modern Architectures Fundamental Theorem of Dependence Fundamental Theorem of Dependence: —Any reordering transformation that preserves every dependence in a program preserves the meaning of that program Proof by contradiction. Theorem 2.2 in the book.

Optimizing Compilers for Modern Architectures Fundamental Theorem of Dependence A transformation is said to be valid for the program to which it applies if it preserves all dependences in the program.

Optimizing Compilers for Modern Architectures Distance and Direction Vectors Consider a dependence in a loop nest of n loops —Statement S 1 on iteration i is the source of the dependence —Statement S 2 on iteration j is the sink of the dependence The distance vector is a vector of length n d(i,j) such that: d(i,j) k = j k - i k We shall normalize distance vectors for loops in which the index step size is not equal to 1.

Optimizing Compilers for Modern Architectures Direction Vectors Definition 2.10 in the book: Suppose that there is a dependence from statement S 1 on iteration i of a loop nest of n loops and statement S 2 on iteration j, then the dependence direction vector is D(i,j) is defined as a vector of length n such that “ 0 D(i,j) k =“=” if d(i,j) k = 0 “>” if d(i,j) k < 0

Optimizing Compilers for Modern Architectures Direction Vectors Example: DO I = 1, N DO J = 1, M DO K = 1, L S 1 A(I+1, J, K-1) = A(I, J, K) + 10 ENDDO S 1 has a true dependence on itself. Distance Vector: (1, 0, -1) Direction Vector: ( )

Optimizing Compilers for Modern Architectures Direction Vectors A dependence cannot exist if it has a direction vector whose leftmost non "=" component is not "<" as this would imply that the sink of the dependence occurs before the source.

Optimizing Compilers for Modern Architectures Direction Vector Transformation Theorem 2.3. Direction Vector Transformation. Let T be a transformation that is applied to a loop nest and that does not rearrange the statements in the body of the loop. Then the transformation is valid if, after it is applied, none of the direction vectors for dependences with source and sink in the nest has a leftmost non- “=” component that is “>”. Follows from Fundamental Theorem of Dependence: —All dependences exist —None of the dependences have been reversed

Optimizing Compilers for Modern Architectures Loop-carried and Loop-independent Dependences If in a loop statement S 2 depends on S 1, then there are two possible ways of this dependence occurring: 1. S 1 and S 2 execute on different iterations —This is called a loop-carried dependence. 2. S 1 and S 2 execute on the same iteration —This is called a loop-independent dependence.

Optimizing Compilers for Modern Architectures Loop-carried dependence Definition 2.11 Statement S 2 has a loop-carried dependence on statement S 1 if and only if S 1 references location M on iteration i, S 2 references M on iteration j and d(i,j) > 0 (that is, D(i,j) contains a “<” as leftmost non “=” component). Example: DO I = 1, N S 1 A(I+1) = F(I) S 2 F(I+1) = A(I) ENDDO

Optimizing Compilers for Modern Architectures Loop-carried dependence Level of a loop-carried dependence is the index of the leftmost non-“=” of D(i,j) for the dependence. For instance: DO I = 1, 10 DO J = 1, 10 DO K = 1, 10 S 1 A(I, J, K+1) = A(I, J, K) ENDDO Direction vector for S1 is (=, =, <) Level of the dependence is 3 A level-k dependence between S 1 and S 2 is denoted by S 1  k S 2

Optimizing Compilers for Modern Architectures Loop-carried Transformations Theorem 2.4 Any reordering transformation that does not alter the relative order of any loops in the nest and preserves the iteration order of the level-k loop preserves all level-k dependences. Proof: —D(i, j) has a “<” in the k th position and “=” in positions 1 through k-1  Source and sink of dependence are in the same iteration of loops 1 through k-1  Cannot change the sense of the dependence by a reordering of iterations of those loops As a result of the theorem, powerful transformations can be applied

Optimizing Compilers for Modern Architectures Loop-carried Transformations Example: DO I = 1, 10 S 1 A(I+1) = F(I) S 2 F(I+1) = A(I) ENDDO can be transformed to: DO I = 1, 10 S 1 F(I+1) = A(I) S 2 A(I+1) = F(I) ENDDO

Optimizing Compilers for Modern Architectures Loop-independent dependences Definition Statement S 2 has a loop-independent dependence on statement S 1 if and only if there exist two iteration vectors i and j such that: 1) Statement S 1 refers to memory location M on iteration i, S 2 refers to M on iteration j, and i = j. 2) There is a control flow path from S 1 to S 2 within the iteration. Example: DO I = 1, 10 S 1 A(I) =... S 2... = A(I) ENDDO

Optimizing Compilers for Modern Architectures Loop-independent dependences More complicated example: DO I = 1, 9 S 1 A(I) =... S 2... = A(10-I) ENDDO No common loop is necessary. For instance: DO I = 1, 10 S 1 A(I) =... ENDDO DO I = 1, 10 S 2... = A(20-I) ENDDO

Optimizing Compilers for Modern Architectures Loop-independent dependences Theorem 2.5. If there is a loop-independent dependence from S 1 to S 2, any reordering transformation that does not move statement instances between iterations and preserves the relative order of S 1 and S 2 in the loop body preserves that dependence. S 2 depends on S 1 with a loop independent dependence is denoted by S 1   S 2 Note that the direction vector will have entries that are all “=” for loop independent dependences

Optimizing Compilers for Modern Architectures Loop-carried and Loop-independent Dependences Loop-independent and loop-carried dependence partition all possible data dependences! Note that if S 1  S 2, then S 1 executes before S 2. This can happen only if: —The difference vector for the dependence is less than 0, or —The difference vector equals 0 and S 1 occurs before S 2 textually...precisely the criteria for loop-carried and loop-independent dependences.

Optimizing Compilers for Modern Architectures Simple Dependence Testing Theorem 2.7: Let a and b be iteration vectors within the iteration space of the following loop nest: DO i 1 = L 1, U 1, S 1 DO i 2 = L 2, U 2, S 2... DO i n = L n, U n, S n S 1 A(f 1 (i 1,...,i n ),...,f m (i 1,...,i n )) =... S 2... = A(g 1 (i 1,...,i n ),...,g m (i 1,...,i n )) ENDDO... ENDDO

Optimizing Compilers for Modern Architectures Simple Dependence Testing DO i 1 = L 1, U 1, S 1 DO i 2 = L 2, U 2, S 2... DO i n = L n, U n, S n S 1 A(f 1 (i 1,...,i n ),...,f m (i 1,...,i n )) =... S 2... = A(g 1 (i 1,...,i n ),...,g m (i 1,...,i n )) ENDDO... ENDDO A dependence exists from S 1 to S 2 if and only if there exist values of  and  such that (1)  is lexicographically less than or equal to  and (2) the following system of dependence equations is satisfied: f i (  ) = g i (  ) for all i, 1  i  m Direct application of Loop Dependence Theorem

Optimizing Compilers for Modern Architectures Simple Dependence Testing: Delta Notation Notation represents index values at the source and sink Example: DO I = 1, N SA(I + 1) = A(I) + B ENDDO Iteration at source denoted by: I 0 Iteration at sink denoted by: I 0 +  I Forming an equality gets us: I = I 0 +  I Solving this gives us:  I = 1  Carried dependence with distance vector (1) and direction vector (<)

Optimizing Compilers for Modern Architectures Simple Dependence Testing: Delta Notation Example: DO I = 1, 100 DO J = 1, 100 DO K = 1, 100 A(I+1,J,K) = A(I,J,K+1) + B ENDDO I = I 0 +  I; J 0 = J 0 +  J; K 0 = K 0 +  K + 1 Solutions:  I = 1;  J = 0;  K = -1 Corresponding direction vector: ( )

Optimizing Compilers for Modern Architectures Simple Dependence Testing: Delta Notation If a loop index does not appear, its distance is unconstrained and its direction is “*” Example: DO I = 1, 100 DO J = 1, 100 A(I+1) = A(I) + B(J) ENDDO The direction vector for the dependence is (<, *)

Optimizing Compilers for Modern Architectures Simple Dependence Testing: Delta Notation * denotes union of all 3 directions Example: DO J = 1, 100 DO I = 1, 100 A(I+1) = A(I) + B(J) ENDDO (*,, <) } Note: (>, )

Optimizing Compilers for Modern Architectures Parallelization and Vectorization Theorem 2.8. It is valid to convert a sequential loop to a parallel loop if the loop carries no dependence. Want to convert loops like: DO I=1,N X(I) = X(I) + C ENDDO to X(1:N) = X(1:N) + C (Fortran 77 to Fortran 90) However: DO I=1,N X(I+1) = X(I) + C ENDDO is not equivalent to X(2:N+1) = X(1:N) + C

Optimizing Compilers for Modern Architectures Loop Distribution Can statements in loops which carry dependences be vectorized? D0 I = 1, N S 1 A(I+1) = B(I) + C S 2 D(I) = A(I) + E ENDDO Dependence: S 1  1 S 2 can be converted to: S 1 A(2:N+1) = B(1:N) + C S 2 D(1:N) = A(1:N) + E

Optimizing Compilers for Modern Architectures Loop Distribution DO I = 1, N S 1 A(I+1) = B(I) + C S 2 D(I) = A(I) + E ENDDO transformed to: DO I = 1, N S 1 A(I+1) = B(I) + C ENDDO DO I = 1, N S 2 D(I) = A(I) + E ENDDO leads to: S 1 A(2:N+1) = B(1:N) + C S 2 D(1:N) = A(1:N) + E

Optimizing Compilers for Modern Architectures Loop Distribution Loop distribution fails if there is a cycle of dependences DO I = 1, N S 1 A(I+1) = B(I) + C S 2 B(I+1) = A(I) + E ENDDO S 1  1 S 2 and S 2  1 S 1 What about: DO I = 1, N S 1 B(I) = A(I) + E S 2 A(I+1) = B(I) + C ENDDO

Optimizing Compilers for Modern Architectures Simple Vectorization Algorithm procedure vectorize (L, D) // L is the maximal loop nest containing the statement. // D is the dependence graph for statements in L. find the set {S 1, S 2,..., S m } of maximal strongly-connected regions in the dependence graph D restricted to L (Tarjan); construct L p from L by reducing each S i to a single node and compute D p, the dependence graph naturally induced on L p by D; let {p 1, p 2,..., p m } be the m nodes of L p numbered in an order consistent with D p (use topological sort); for i = 1 to m do begin if p i is a dependence cycle then generate a DO-loop around the statements in p i ; else directly rewrite p i in Fortran 90, vectorizing it with respect to every loop containing it; end end vectorize

Optimizing Compilers for Modern Architectures Problems With Simple Vectorization DO I = 1, N DO J = 1, M S 1 A(I+1,J) = A(I,J) + B ENDDO Dependence from S 1 to itself with d(i, j) = (1,0) Key observation: Since dependence is at level 1, we can manipulate the other loop! Can be converted to: DO I = 1, N S 1 A(I+1,1:M) = A(I,1:M) + B ENDDO The simple algorithm does not capitalize on such opportunities

Optimizing Compilers for Modern Architectures Advanced Vectorization Algorithm procedure codegen(R, k, D); // R is the region for which we must generate code. // k is the minimum nesting level of possible parallel loops. // D is the dependence graph among statements in R.. find the set {S 1, S 2,..., S m } of maximal strongly-connected regions in the dependence graph D restricted to R; construct R p from R by reducing each S i to a single node and compute D p, the dependence graph naturally induced on R p by D; let {p 1, p 2,..., p m } be the m nodes of R p numbered in an order consistent with D p (use topological sort to do the numbering); for i = 1 to m do begin if p i is cyclic then begin generate a level-k DO statement; let D i be the dependence graph consisting of all dependence edges in D that are at level k+1 or greater and are internal to p i ; codegen (p i, k+1, D i ); generate the level-k ENDDO statement; end else generate a vector statement for p i in r(p i )-k+1 dimensions, where r (p i ) is the number of loops containing p i ; end

Optimizing Compilers for Modern Architectures Advanced Vectorization Algorithm DO I = 1, 100 S 1 X(I) = Y(I) + 10 DO J = 1, 100 S 2 B(J) = A(J,N) DO K = 1, 100 S 3 A(J+1,K)=B(J)+C(J,K) ENDDO S 4 Y(I+J) = A(J+1, N) ENDDO

Optimizing Compilers for Modern Architectures Advanced Vectorization Algorithm DO I = 1, 100 S 1 X(I) = Y(I) + 10 DO J = 1, 100 S 2 B(J) = A(J,N) DO K = 1, 100 S 3 A(J+1,K)=B(J)+C(J,K) ENDDO S 4 Y(I+J) = A(J+1, N) ENDDO Simple dependence testing procedure: True dependence from S 4 to S 1 I 0 + J = I 0 +  I   I = J As J is always positive  Direction is “<”

Optimizing Compilers for Modern Architectures Advanced Vectorization Algorithm DO I = 1, 100 S 1 X(I) = Y(I) + 10 DO J = 1, 100 S 2 B(J) = A(J,N) DO K = 1, 100 S 3 A(J+1,K)=B(J)+C(J,K) ENDDO S 4 Y(I+J) = A(J+1, N) ENDDO S 2 and S 3 : dependence via B(J) I does not occur in either subscript (D.V = *) We get: J 0 = J 0 +  J   J = 0  Direction vectors = (*, =)

Optimizing Compilers for Modern Architectures Advanced Vectorization Algorithm DO I = 1, 100 S 1 X(I) = Y(I) + 10 DO J = 1, 100 S 2 B(J) = A(J,N) DO K = 1, 100 S 3 A(J+1,K)=B(J)+C(J,K) ENDDO S 4 Y(I+J) = A(J+1, N) ENDDO DO I = 1, 100 codegen({S 2, S 3, S 4 }, 2}) ENDDO X(1:100) = Y(1:100) + 10 codegen called at the outermost level S 1 will be vectorized

Optimizing Compilers for Modern Architectures Advanced Vectorization Algorithm DO I = 1, 100 DO J = 1, 100 codegen({S 2, S 3 }, 3}) ENDDO S 4 Y(I+1:I+100) = A(2:101,N) ENDDO X(1:100) = Y(1:100) + 10 codegen ({S 2, S 3, S 4 }, 2}) level-1 dependences are stripped off

Optimizing Compilers for Modern Architectures Advanced Vectorization Algorithm codegen ({S 2, S 3 }, 3}) level-2 dependences are stripped off DO I = 1, 100 S 1 X(I) = Y(I) + 10 DO J = 1, 100 S 2 B(J) = A(J,N) DO K = 1, 100 S 3 A(J+1,K)=B(J)+C(J,K) ENDDO S 4 Y(I+J) = A(J+1, N) ENDDO DO I = 1, 100 DO J = 1, 100 B(J) = A(J,N) A(J+1,1:100)=B(J)+C(J,1:100) ENDDO Y(I+1:I+100) = A(2:101,N) ENDDO X(1:100) = Y(1:100) + 10