Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

Slides:



Advertisements
Similar presentations
Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.
Advertisements

Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures.
Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.
1 ECE734 VLSI Arrays for Digital Signal Processing Loop Transformation.
8. Static Single Assignment Form Marcus Denker. © Marcus Denker SSA Roadmap  Static Single Assignment Form (SSA)  Converting to SSA Form  Examples.
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
1 Optimizing compilers Managing Cache Bercovici Sivan.
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Lecture 11: Code Optimization CS 540 George Mason University.
Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely.
SSA.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.
Transportation Problem (TP) and Assignment Problem (AP)
Chapter 4: Trees Part II - AVL Tree
Compiler Challenges for High Performance Architectures
Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.
Dependence Analysis Kathy Yelick Bebop group meeting, 8/3/01.
Optimizing Compilers for Modern Architectures Preliminary Transformations Chapter 4 of Allen and Kennedy.
Preliminary Transformations Chapter 4 of Allen and Kennedy Harel Paz.
6/9/2015© Hal Perkins & UW CSEU-1 CSE P 501 – Compilers SSA Hal Perkins Winter 2008.
Enhancing Fine- Grained Parallelism Chapter 5 of Allen and Kennedy Mirit & Haim.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
CMPUT680 - Fall 2006 Topic A: Data Dependence in Loops José Nelson Amaral
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2004 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
Approximation Algorithms
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky.
Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.
CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral
Dependence Testing Optimizing Compilers for Modern Architectures, Chapter 3 Allen and Kennedy Presented by Rachel Tzoref and Rotem Oshman.
Improving Code Generation Honors Compilers April 16 th 2002.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
Creating Coarse-Grained Parallelism Chapter 6 of Allen and Kennedy Dan Guez.
Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp
Dependence: Theory and Practice Allen and Kennedy, Chapter 2 Liza Fireman.
Optimizing Compilers for Modern Architectures Compiling High Performance Fortran Allen and Kennedy, Chapter 14.
Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.
CHAPTER 71 TREE. Binary Tree A binary tree T is a finite set of one or more nodes such that: (a) T is empty or (b) There is a specially designated node.
09/15/2011CS4961 CS4961 Parallel Programming Lecture 8: Dependences and Locality Optimizations Mary Hall September 15,
Array Dependence Analysis COMP 621 Special Topics By Nurudeen Lameed
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.
1 Theory and Practice of Dependence Testing Data and control dependences Scalar data dependences  True-, anti-, and output-dependences Loop dependences.
Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.
Advanced Compiler Techniques LIU Xianhua School of EECS, Peking University Loops.
Program Analysis & Transformations Loop Parallelization and Vectorization Toheed Aslam.
Optimizing Compilers for Modern Architectures Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9.
Optimizing Compilers for Modern Architectures Enhancing Fine-Grained Parallelism Part II Chapter 5 of Allen and Kennedy.
Memory-Aware Compilation Philip Sweany 10/20/2011.
Carnegie Mellon Lecture 8 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm Reading: Chapter 10.5 – 10.6 M. LamCS243: Software.
DEPENDENCE-DRIVEN LOOP MANIPULATION Based on notes by David Padua University of Illinois at Urbana-Champaign 1.
CS314 – Section 5 Recitation 13
The minimum cost flow problem
Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)
Loop Restructuring Loop unswitching Loop peeling Loop fusion
Register Pressure Guided Unroll-and-Jam
Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.
Optimizations using SSA
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CSE P 501 – Compilers SSA Hal Perkins Autumn /31/2019
Introduction to Optimization
Optimizing single thread performance
Presentation transcript:

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures

Fine-Grained Parallelism Techniques to enhance fine-grained parallelism: Loop Interchange Scalar Expansion Scalar Renaming Array Renaming

Optimizing Compilers for Modern Architectures Prelude: A Long Time Ago... procedure codegen(R, k, D); // R is the region for which we must generate code. // k is the minimum nesting level of possible parallel loops. // D is the dependence graph among statements in R.. find the set {S 1, S 2,..., S m } of maximal strongly-connected regions in the dependence graph D restricted to R construct R p from R by reducing each S i to a single node and compute D p, the dependence graph naturally induced on R p by D let {p 1, p 2,..., p m } be the m nodes of R p numbered in an order consistent with D p (use topological sort to do the numbering); for i = 1 to m do begin if p i is cyclic then begin generate a level-k DO statement; let D i be the dependence graph consisting of all dependence edges in D that are at level k+1 or greater and are internal to p i ; codegen (p i, k+1, D i ); generate the level-k ENDDO statement; end else generate a vector statement for p i in r(p i )-k+1 dimensions, where r (p i ) is the number of loops containing p i ; end We fail here

Optimizing Compilers for Modern Architectures Prelude: A Long Time Ago... Codegen: tries to find parallelism using transformations of loop distribution and statement reordering If we deal with loops containing cyclic dependences early on in the loop nest, we can potentially vectorize more loops Goal in Chapter 5: To explore other transformations to exploit parallelism

Optimizing Compilers for Modern Architectures Motivational Example DO J = 1, M DO I = 1, N T = 0.0 DO K = 1,L T = T + A(I,K) * B(K,J) ENDDO C(I,J) = T ENDDO codegen will not uncover any vector operations. However, by scalar expansion, we can get: DO J = 1, M DO I = 1, N T$(I) = 0.0 DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO C(I,J) = T$(I) ENDDO

Optimizing Compilers for Modern Architectures Motivational Example DO J = 1, M DO I = 1, N T$(I) = 0.0 DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO C(I,J) = T$(I) ENDDO

Optimizing Compilers for Modern Architectures Motivational Example II Loop Distribution gives us: DO J = 1, M DO I = 1, N T$(I) = 0.0 ENDDO DO I = 1, N DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO DO I = 1, N C(I,J) = T$(I) ENDDO

Optimizing Compilers for Modern Architectures Motivational Example III Finally, interchanging I and K loops, we get: DO J = 1, M T$(1:N) = 0.0 DO K = 1,L T$(1:N) = T$(1:N) + A(1:N,K) * B(K,J) ENDDO C(1:N,J) = T$(1:N) ENDDO A couple of new transformations used: —Loop interchange —Scalar Expansion

Optimizing Compilers for Modern Architectures Loop Interchange DO I = 1, N DO J = 1, M S A(I,J+1) = A(I,J) + B DV: (=, <) ENDDO Applying loop interchange: DO J = 1, M DO I = 1, N S A(I,J+1) = A(I,J) + B DV: (<, =) ENDDO leads to: DO J = 1, M S A(1:N,J+1) = A(1:N,J) + B ENDDO

Optimizing Compilers for Modern Architectures Loop Interchange Loop interchange is a reordering transformation Why? —Think of statements being parameterized with the corresponding iteration vector —Loop interchange merely changes the execution order of these statements. — It does not create new instances, or delete existing instances DO J = 1, M DO I = 1, N S ENDDO If interchanged, S(2, 1) will execute before S(1, 2)

Optimizing Compilers for Modern Architectures Loop Interchange: Safety Safety: not all loop interchanges are safe DO J = 1, M DO I = 1, N A(I,J+1) = A(I+1,J) + B ENDDO Direction vector ( ) If we interchange loops, we violate the dependence

Optimizing Compilers for Modern Architectures Loop Interchange: Safety A dependence is interchange-preventing with respect to a given pair of loops if interchanging those loops would reorder the endpoints of the dependence.

Optimizing Compilers for Modern Architectures Loop Interchange: Safety A dependence is interchange-sensitive if it is carried by the same loop after interchange. That is, an interchange-sensitive dependence moves with its original carrier loop to the new level.

Optimizing Compilers for Modern Architectures Loop Interchange: Safety Theorem 5.1 Let D(i,j) be a direction vector for a dependence in a perfect nest of n loops. Then the direction vector for the same dependence after a permutation of the loops in the nest is determined by applying the same permutation to the elements of D(i,j). The direction matrix for a nest of loops is a matrix in which each row is a direction vector for some dependence between statements contained in the nest and every such direction vector is represented by a row.

Optimizing Compilers for Modern Architectures Loop Interchange: Safety DO I = 1, N DO J = 1, M DO K = 1, L A(I+1,J+1,K) = A(I,J,K) + A(I,J+1,K+1) ENDDO The direction matrix for the loop nest is: < < = Theorem 5.2 A permutation of the loops in a perfect nest is legal if and only if the direction matrix, after the same permutation is applied to its columns, has no ">" direction as the leftmost non-"=" direction in any row. Follows from Theorem 5.1 and Theorem 2.3

Optimizing Compilers for Modern Architectures Loop Interchange: Profitability Profitability depends on architecture DO I = 1, N DO J = 1, M DO K = 1, L S A(I+1,J+1,K) = A(I,J,K) + B ENDDO For SIMD machines with large number of FU’s: DO I = 1, N S A(I+1,2:M+1,1:L) = A(I,1:M,1:L) + B ENDDO Not suitable for vector register machines

Optimizing Compilers for Modern Architectures Loop Interchange: Profitability For Vector machines, we want to vectorize loops with stride-one memory access Since Fortran stores in column-major order: —useful to vectorize the I-loop Thus, transform to: DO J = 1, M DO K = 1, L S A(2:N+1,J+1,K) = A(1:N,J,K) + B ENDDO

Optimizing Compilers for Modern Architectures Loop Interchange: Profitability MIMD machines with vector execution units: want to cut down synchronization costs Hence, shift K-loop to outermost level: PARALLEL DO K = 1, L DO J = 1, M A(2:N+1,J+1,K) = A(1:N,J,K) + B ENDDO END PARALLEL DO

Optimizing Compilers for Modern Architectures Loop Shifting Motivation: Identify loops which can be moved and move them to “optimal” nesting levels Theorem 5.3 In a perfect loop nest, if loops at level i, i+1,...,i+n carry no dependence, it is always legal to shift these loops inside of loop i+n+1. Furthermore, these loops will not carry any dependences in their new position. Proof:

Optimizing Compilers for Modern Architectures Loop Shifting DO I = 1, N DO J = 1, N DO K = 1, N S A(I,J) = A(I,J) + B(I,K)*C(K,J) ENDDO S has true, anti and output dependences on itself, hence codegen will fail as recurrence exists at innermost level Use loop shifting to move K-loop to the outermost level: DO K= 1, N DO I = 1, N DO J = 1, N S A(I,J) = A(I,J) + B(I,K)*C(K,J) ENDDO

Optimizing Compilers for Modern Architectures Loop Shifting DO K= 1, N DO I = 1, N DO J = 1, N S A(I,J) = A(I,J) + B(I,K)*C(K,J) ENDDO codegen vectorizes to: DO K = 1, N FORALL J=1,N A(1:N,J) = A(1:N,J) + B(1:N,K)*C(K,J) END FORALL ENDDO

Optimizing Compilers for Modern Architectures Loop Shifting Change body of codegen: if p i is cyclic then if k is the deepest loop in p i then try_recurrence_breaking(p i, D, k) else begin select_loop_and_interchange(p i, D, k); generate a level-k DO statement; let D i be the dependence graph consisting of all dependence edges in D that are at level k+1 or greater and are internal to p i ; codegen (p i, k+1, D i ); generate the level-k ENDDO statement end

Optimizing Compilers for Modern Architectures Loop Shifting procedure select_loop_and_interchange(  i, D, k) if the outermost carried dependence in  i is at level p>k then shift loops at level k,k+1,...,p-1 inside the level-p loop, making it into the level-k loop; return; end select_loop_and_interchange

Optimizing Compilers for Modern Architectures Loop Selection Consider: DO I = 1, N DO J = 1, M S A(I+1,J+1) = A(I,J) + A(I+1,J) ENDDO Direction matrix: < < = < Loop shifting algorithm will fail to uncover vector loops; however, interchanging the loops can lead to: DO J = 1, M A(2:N+1,J+1) = A(1:N,J) + A(2:N+1,J) ENDDO Need a more general algorithm

Optimizing Compilers for Modern Architectures Loop Selection Loop selection: —Select a loop at nesting level p  k that can be safely moved outward to level k and shift the loops at level k, k+1, …, p-1 inside it

Optimizing Compilers for Modern Architectures Loop Selection Heuristics for selecting loop level —If the level-k loop carries no dependence, then let p be the smallest integer such that the level-p loop carries a dependence. (loop-shifting heuristic.) —If the level-k loop carries a dependence, let p be the outermost loop that can be safely shifted outward to position k and that carries a dependence d whose direction vector contains an "=" in every position but the p th. If no such loop exists, let p = k. = = =... = = = < <... = = < = =... Direction vector Loop p

Optimizing Compilers for Modern Architectures Scalar Expansion DO I = 1, N S 1 T = A(I) S 2 A(I) = B(I) S 3 B(I) = T ENDDO Scalar Expansion: DO I = 1, N S 1 T$(I) = A(I) S 2 A(I) = B(I) S 3 B(I) = T$(I) ENDDO T = T$(N) leads to: S 1 T$(1:N) = A(1:N) S 2 A(1:N) = B(1:N) S 3 B(1:N) = T$(1:N) T = T$(N)

Optimizing Compilers for Modern Architectures Scalar Expansion However, not always profitable. Consider: DO I = 1, N T = T + A(I) + A(I+1) A(I) = T ENDDO Scalar expansion gives us: T$(0) = T DO I = 1, N S 1 T$(I) = T$(I-1) + A(I) + A(I+1) S 2 A(I) = T$(I) ENDDO T = T$(N)

Optimizing Compilers for Modern Architectures Scalar Expansion: Safety Scalar expansion is always safe When is it profitable? —Naïve approach: Expand all scalars, vectorize, shrink all unnecessary expansions. —However, we want to predict when expansion is profitable Dependences due to reuse of memory location vs. reuse of values —Dependences due to reuse of values must be preserved —Dependences due to reuse of memory location can be deleted by expansion

Optimizing Compilers for Modern Architectures Scalar Expansion: Covering Definitions A definition X of a scalar S is a covering definition for loop L if a definition of S placed at the beginning of L reaches no uses of S that occur past X. DO I = 1, 100 S 1 T = X(I) S 2 Y(I) = T ENDDO DO I = 1, 100 IF (A(I).GT. 0) THEN S 1 T = X(I) S 2 Y(I) = T ENDIF ENDDO covering

Optimizing Compilers for Modern Architectures Scalar Expansion: Covering Definitions A covering definition does not always exist: DO I = 1, 100 IF (A(I).GT. 0) THEN S 1 T = X(I) ENDIF S 2 Y(I) = T ENDDO In SSA terms: There does not exist a covering definition for a variable T if the edge out of the first assignment to T goes to a  -function later in the loop which merges its values with those for another control flow path through the loop

Optimizing Compilers for Modern Architectures Scalar Expansion: Covering Definitions We will consider a collection of covering definitions There is a collection C of covering definitions for T in a loop if either: —There exists no  -function at the beginning of the loop that merges versions of T from outside the loop with versions defined in the loop, or, —The  -function within the loop has no SSA edge to any  -function including itself

Optimizing Compilers for Modern Architectures Scalar Expansion: Covering Definitions Remember the loop which had no covering definition: DO I = 1, 100 IF (A(I).GT. 0) THEN S 1 T = X(I) ENDIF S 2 Y(I) = T ENDDO To form a collection of covering definitions, we can insert dummy assignments: DO I = 1, 100 IF (A(I).GT. 0) THEN S 1 T = X(I) ELSE S 2 T = T ENDIF S 3 Y(I) = T ENDDO

Optimizing Compilers for Modern Architectures Scalar Expansion: Covering Definitions Algorithm to insert dummy assignments and compute the collection, C, of covering definitions: —Central idea: Look for parallel paths to a  -function following the first assignment, until no more exist

Optimizing Compilers for Modern Architectures Scalar Expansion: Covering Definitions Detailed algorithm: Let S 0 be the  -function for T at the beginning of the loop, if there is one, and null otherwise. Make C empty and initialize an empty stack. Let S 1 be the first definition of T in the loop. Add S 1 to C. If the SSA successor of S 1 is a  -function S 2 that is not equal to S 0, then push S 2 onto the stack and mark it; While the stack is non-empty, —pop the  -function S from the stack; —add all SSA predecessors that are not  -functions to C; —if there is an SSA edge from S 0 into S, then insert the assignment T = T as the last statement along that edge and add it to C; —for each unmarked  -function S 3 (other than S 0 ) that is an SSA predecessor of S, mark S 3 and push it onto the stack; —for each unmarked  -function S 4 that can be reached from S by a single SSA edge and which is not predominated by S in the control flow graph mark S 4 and push it onto the stack.

Optimizing Compilers for Modern Architectures Scalar Expansion: Covering Definitions Given the collection of covering definitions, we can carry out scalar expansion for a normalized loop: Create an array T$ of appropriate length For each S in the covering definition collection C, replace the T on the left-hand side by T$(I). For every other definition of T and every use of T in the loop body reachable by SSA edges that do not pass through S 0, the  -function at the beginning of the loop, replace T by T$(I). For every use prior to a covering definition (direct successors of S 0 in the SSA graph), replace T by T$(I-1). If S 0 is not null, then insert T$(0) = T before the loop. If there is an SSA edge from any definition in the loop to a use outside the loop, insert T = T$(U) after the loop, were U is the loop upper bound.

Optimizing Compilers for Modern Architectures Scalar Expansion: Covering Definitions DO I = 1, 100 IF (A(I).GT. 0) THEN S 1 T = X(I) ENDIF S 2 Y(I) = T ENDDO DO I = 1, 100 IF (A(I).GT. 0) THEN S 1 T = X(I) ELSE S 2 T = T ENDIF S 3 Y(I) = T ENDDO T$(0) = T DO I = 1, 100 IF (A(I).GT. 0) THEN S 1 T$(I) = X(I) ELSE T$(I) = T$(I-1) ENDIF S 2 Y(I) = T$(I) ENDDO After inserting covering definitions: After scalar expansion:

Optimizing Compilers for Modern Architectures Deletable Dependences Uses of T before covering definitions are expanded as T$(I - 1) All other uses are expanded as T$(I) then the deletable dependences are: —Backward carried antidependences —Backward carried output dependences —Forward carried output dependences —Loop-independent antidependences into the covering definition —Loop-carried true dependences from a covering definition

Optimizing Compilers for Modern Architectures Scalar Expansion procedure try_recurrence_breaking(  i, D, k) if k is the deepest loop in  i then begin remove deletable edges in  i ; find the set {SC 1, SC 2,..., SC n } of maximal strongly-connected regions in D restricted to  i ; if there are vector statements among SC i then expand scalars indicated by deletable edges; codegen(  i, k, D restricted to  i ); end try_recurrence_breaking

Optimizing Compilers for Modern Architectures Scalar Expansion: Drawbacks Expansion increases memory requirements Solutions: —Expand in a single loop —Strip mine loop before expansion —Forward substitution: DO I = 1, N T = A(I) + A(I+1) A(I) = T + B(I) ENDDO DO I = 1, N A(I) = A(I) + A(I+1) + B(I) ENDDO

Optimizing Compilers for Modern Architectures Scalar Renaming DO I = 1, 100 S 1 T = A(I) + B(I) S 2 C(I) = T + T S 3 T = D(I) - B(I) S 4 A(I+1) = T * T ENDDO Renaming scalar T: DO I = 1, 100 S 1 T1 = A(I) + B(I) S 2 C(I) = T1 + T1 S 3 T2 = D(I) - B(I) S 4 A(I+1) = T2 * T2 ENDDO

Optimizing Compilers for Modern Architectures Scalar Renaming will lead to: S 3 T2$(1:100) = D(1:100) - B(1:100) S 4 A(2:101) = T2$(1:100) * T2$(1:100) S 1 T1$(1:100) = A(1:100) + B(1:100) S 2 C(1:100) = T1$(1:100) + T1$(1:100) T = T2$(100)

Optimizing Compilers for Modern Architectures Scalar Renaming Renaming algorithm partitions all definitions and uses into equivalent classes, each of which can occupy different memory locations: —Use the definition-use graph to: —Pick definition —Add all uses that the definition reaches to the equivalence class —Add all definitions that reach any of the uses… —..until fixed point is reached

Optimizing Compilers for Modern Architectures Scalar Renaming: Profitability Scalar renaming will break recurrences in which a loop- independent output dependence or antidependence is a critical element of a cycle Relatively cheap to use scalar renaming Usually done by compilers when calculating live ranges for register allocation

Optimizing Compilers for Modern Architectures Array Renaming DO I = 1, N S 1 A(I) = A(I-1) + X S 2 Y(I) = A(I) + Z S 3 A(I) = B(I) + C ENDDO S 1   S 2 S 2   -1 S 3 S 3  1 S 1 S 1   0 S 3 Rename A(I) to A$(I): DO I = 1, N S 1 A$(I) = A(I-1) + X S 2 Y(I) = A$(I) + Z S 3 A(I) = B(I) + C ENDDO Dependences remaining: S 1   S 2 and S 3  1 S 1

Optimizing Compilers for Modern Architectures Array Renaming: Profitability Examining dependence graph and determining minimum set of critical edges to break a recurrence is NP-complete! Solution: determine edges that are removed by array renaming and analyze effects on dependence graph procedure array_partition: —Assumes no control flow in loop body —identifies collections of references to arrays which refer to the same value —identifies deletable output dependences and antidependences Use this procedure to generate code —Minimize amount of copying back to the “original” array at the beginning and the end