Dependence Analysis and Loops CS 3220 Spring 2016.

Slides:



Advertisements
Similar presentations
Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.
Advertisements

1 ECE734 VLSI Arrays for Digital Signal Processing Loop Transformation.
Using the Iteration Space Visualizer in Loop Parallelization Yijun YU
1 Optimizing compilers Managing Cache Bercovici Sivan.
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.
Dependence Analysis Kathy Yelick Bebop group meeting, 8/3/01.
HC TD51021 Loop Transformations Motivation Loop level transformations catalogus –Loop merging –Loop interchange –Loop unrolling –Unroll-and-Jam –Loop tiling.
Parallel and Cluster Computing 1. 2 Optimising Compilers u The main specific optimization is loop vectorization u The compilers –Try to recognize such.
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
Enhancing Fine- Grained Parallelism Chapter 5 of Allen and Kennedy Mirit & Haim.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
CMPUT680 - Fall 2006 Topic A: Data Dependence in Loops José Nelson Amaral
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
Stanford University CS243 Winter 2006 Wei Li 1 Data Dependences and Parallelization.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky.
Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.
1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G)
CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral
Dependence Testing Optimizing Compilers for Modern Architectures, Chapter 3 Allen and Kennedy Presented by Rachel Tzoref and Rotem Oshman.
Parallelizing Compilers Presented by Yiwei Zhang.
A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.
Data Dependences CS 524 – High-Performance Computing.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Assignment 2 posted; due in a week.
Embedded Computer Architecture TU/e 5kk73 Henk Corporaal Bart Mesman Loop Transformations.
1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Please hand in Assignment 1 now Assignment.
Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp
Dependence: Theory and Practice Allen and Kennedy, Chapter 2 Liza Fireman.
Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.
Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)
1 Parallel Programming using the Iteration Space Visualizer Yijun YuYijun Yu and Erik H. D'HollanderErik H. D'Hollander University of Ghent, Belgium
09/15/2011CS4961 CS4961 Parallel Programming Lecture 8: Dependences and Locality Optimizations Mary Hall September 15,
Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.
Array Dependence Analysis COMP 621 Special Topics By Nurudeen Lameed
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.
Carnegie Mellon Lecture 14 Loop Optimization and Array Analysis I. Motivation II. Data dependence analysis Chapter , 11.6 Dror E. MaydanCS243:
1 Theory and Practice of Dependence Testing Data and control dependences Scalar data dependences  True-, anti-, and output-dependences Loop dependences.
Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.
*All other brands and names are the property of their respective owners Intel Confidential IA64_Tools_Overview2.ppt 1 修改程序代码以 利用编译器实现优化
High-Level Transformations for Embedded Computing
CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.
Improving Locality through Loop Transformations Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at.
ECE 1754 Loop Transformations by: Eric LaForest
Advanced Compiler Techniques LIU Xianhua School of EECS, Peking University Loops.
Optimizing Compilers for Modern Architectures Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9.
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #13 – Other.
CR18: Advanced Compilers L02: Dependence Analysis Tomofumi Yuki 1.
DEPENDENCE-DRIVEN LOOP MANIPULATION Based on notes by David Padua University of Illinois at Urbana-Champaign 1.
Lecture 38: Compiling for Modern Architectures 03 May 02
CS314 – Section 5 Recitation 13
Dependence Analysis Important and difficult
Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)
Loop Restructuring Loop unswitching Loop peeling Loop fusion
Parallelizing Loops Moreno Marzolla
CS 213: Data Structures and Algorithms
Lecture 11: Advanced Static ILP
Parallelization, Compilation and Platforms 5LIM0
Register Pressure Guided Unroll-and-Jam
A Unified Framework for Schedule and Storage Optimization
Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Introduction to Optimization
Optimizing single thread performance
Presentation transcript:

Dependence Analysis and Loops CS 3220 Spring 2016

Loop Examples  Loop Permutation for improved locality do j = 1,6 do i = 1,5 do i = 1,5 A(j,i) = A(j,i)+1 A(j,i) = A(j,i)+1 enddo enddoenddo do i = 1,5 do j = 1,6 do j = 1,6 A(j,i) = A(j,i)+1 A(j,i) = A(j,i)+1 enddo enddoenddo

Loop Examples  Parallelization do i = 1,100 A(i) = A(i)+1 A(i) = A(i)+1enddo do i = 1,100 A(i) = A(i-1)+1 A(i) = A(i-1)+1enddo

Data Dependences and Loops  How do we identify dependences in loops?  Simple view – Imagine that all loops are fully unrolled – Examine data dependences as before  Problems  Impractical  Lose loop structure do i = 1,5 A(i) = A(i-1)+1 A(i) = A(i-1)+1enddo

Data Dependence  Definition: Data dependence are constraints on the order in which statements may be executed  Types of dependences  Flow (true) dependences 1 writes memory that s 2 later reads (RAW)  Anti-dependence: s 1 reads memory that s 2 later writes (WAR)  Output dependences: s 1 writes memory that s 2 later writes (WAW)  Input dependences: s 1 reads memory that s 2 later reads (RAR)  Notation: s 1 δ s 2  s 1 is called the source of the dependence  s 2 is called the sink or target  s 1 must be executed before s 2

Example s1 a = b; s2 b = c + d; s3 e = a + d; s4 b = 3; s5 f = b * 2;

Dependences and Loops  Loop-independent dependences  Loop-carried dependences do i = 1,100 A(i) = B(i) + 1 A(i) = B(i) + 1 C(i) = A(i) * 2 C(i) = A(i) * 2enddo do i = 1,100 A(i) = B(i) + 1 A(i) = B(i) + 1 C(i) = A(i-1) * 2 C(i) = A(i-1) * 2enddo Dependences within the same loop iteration Dependences that cross loop iterations

Concepts  Iteration space  A set of tuples that represent the iterations of a loop  Can visualize the dependence in an iterations space do j = 1,6 do i = 1,5 do i = 1,5 A(j,i) = A(j-1,i-1)+1 A(j,i) = A(j-1,i-1)+1 enddo enddoenddo

Protein String Matching Example

Distance Vectors  Idea  Concisely describe dependence relationships between iterations of an iteration space  For each dimension of an iteration space, the distance is the number of iterations between accesses to the same memory location  Definition  v = i T – i S (target– source) do i = 1,6 do j = 1,5 do j = 1,5 A(i,j) = A(i-1,j-2)+1 A(i,j) = A(i-1,j-2)+1 enddo enddoenddo Distance Vector: (1,2)

More Examples  Sample code do i = 1,6 do j = 1,5 do j = 1,5 A(i,j) = A(i-1,j+1)+1 A(i,j) = A(i-1,j+1)+1 enddo enddoenddo Distance vector: ?

Distance Vectors and Loop Transformations  Any transformation we perform on the loop must respect the dependences  Example:  Can we permute the i and j loops? do i = 1,6 do j = 1,5 do j = 1,5 A(i,j) = A(i-1,j-2)+1 A(i,j) = A(i-1,j-2)+1 enddo enddoenddo do j = 1,5 do i = 1,6 do i = 1,6 A(i,j) = A(i-1,j-2)+1 A(i,j) = A(i-1,j-2)+1 enddo enddoenddo

Exercise do i = 1,6 do j = 1,5 A(i,j) = A(i-1,j+1)+1 enddoenddo  Iteration space?  Distance vector?  What if exchange the order of i/j ?  What kinds of dependency?

Direction Vector  Definition  A direction vector serves the same purpose as a distance vector when less precision is required or available  Element i of a direction vector is, or = based on whether the source of the dependence precedes, follows or is in the same iteration as the target in loop I Distance vector = ? (<, <) Direction vector = ? (1,1) do i = 1,6 do j = 1,5 do j = 1,5 A(i,j) = A(i-1,j-1)+1 A(i,j) = A(i-1,j-1)+1 enddo enddoenddo

Distance Vectors: Legality  Definition  A dependence vector, v, is lexicographically nonnegative when the leftmost entry in v is positive or all elements of v are zero Yes: (0,0,0), (0,1), (0,2,-2) No: (-1), (0,-2), (0,-1,1)  A dependence vector is legal when it is lexicographically nonnegative (assuming that indices increase as we iterate)  Why are lexicographically negative distance vectors illegal?  What are legal direction vectors?

Loop-Carried Dependences  Definition  A dependence D=(d1,...dn) is carried at loop level i if di is the first nonzero element of D  Example  Distance vectors: (1,0) for accesses to A (0,1) for accesses to B  Loop-carried dependences The i loop carries dependence due to A The j loop carries dependence due to B do i = 1,5 do j = 1,5 do j = 1,5 A(i,j) = B(i-1,j)+1 A(i,j) = B(i-1,j)+1 B(i,j) = A(i, j-1) * 2 enddo enddoenddo

Parallelization  Idea  Each iteration of a loop may be executed in parallel if it carries no dependences do i = 1,5 do j = 1,5 do j = 1,5 A(i,j) = B(i-1,j-1)+1 A(i,j) = B(i-1,j-1)+1 B(i,j) = A(i, j-1) * 2 enddo enddoenddo i Parallelize i loop?

Parallelization (cont.)  Idea  Each iteration of a loop may be executed in parallel if it carries no dependences do i = 1,5 do j = 1,5 do j = 1,5 A(i,j) = B(i-1,j-1)+1 A(i,j) = B(i-1,j-1)+1 B(i,j) = A(i, j-1) * 2 enddo enddoenddo j Parallelize j loop?

Scalar Expansion  Problem  Loop-carried dependences inhibit parallelism  Scalar references result in loop-carried dependences  Can this loop be parallelized? ?  What kind of dependences are these? ? do i = 1,6 t = A(i) + B(i) C(i) = t + 1/t enddo

Scalar Expansion  Idea  Eliminate false dependences by introducing extra storage  Example  Can this loop be parallelized?  Disadvantages? do i = 1,6 T(i) = A(i) + B(i) C(i) = T(i) + 1/ T(i) enddo

Scalar Expansion Details  Restrictions  The loop must be a countable loop i.e. The loop trip count must be independent of the body of the loop  There can not be loop-carried flow dependences due to the scalar  The expanded scalar must have no upward exposed uses in the loop do i = 1,6 print(t) print(t) t = A(i) + B(i) t = A(i) + B(i) C(i) = t + 1/t C(i) = t + 1/tenddo  Nested loops may require much more storage  When the scalar is live after the loop, we must move the correct array value into the scalar

Example Revisited  Sample code  Why is this legal?  No loop-carried dependences, so we can arbitrarily change order of iteration execution do j = 1,6 do i = 1,5 do i = 1,5 A(j,i) = A(j,i)+1 A(j,i) = A(j,i)+1 enddo enddoenddo do i = 1,5 do j = 1,6 do j = 1,6 A(j,i) = A(j,i)+1 A(j,i) = A(j,i)+1 enddo enddoenddo

Dependence Testing  Consider the following code… do i = 1,5 A(3*i+2) = A(2*i+1)+1 enddo  Question  How do we determine whether one array reference depends on another across iterations of an iteration space?

Dependence Testing in General  General code  There exists a dependence between iterations I=(i1,..., in) and J=(j1,..., jn) when  f(I) = g(J)  (l1,...ln) < I,J < (h1,...,hn) do i1 = l1,h1... do in = ln, hn A(f(i1,...,in)) =... A(g(i1,...,in)) enddo...enddo

Multi-dimension Arrays  Integer linear programming int A[1..100] …A[2*i, 2*j]… …A[2*i+3, 3*j-3] …

Algs for Solving the Dependence Problem  Heuristics  GCD test (Banerjee76,Towle76): determines whether integer solution is possible, no bounds checking  Banerjee test (Banerjee 79): checks real bounds  I-Test (Kong et al. 90): integer solution in real bounds  Lambda test (Li et al. 90): all dimensions simultaneously  Delta test (Goff et al. 91): pattern matches for efficiency  Power test (Wolfe et al. 92): extended GCD and Fourier Motzkin combination  Use some form of Fourier-Motzkin elimination for integers  Parametric Integer Programming (Feautrier91)  Omega test (Pugh92)

Dependence Testing: Simple Case  Sample code do i = l,h A(a*i+c1) =... A(a*i+c2) Enddo  Dependence?  a*i1+c1 = a*i2+c2, or  a*i1 – a*i2 = c2-c1  Solution exists if a divides c2-c1

Example  Code do i = l,h A(2*i+2) = A(2*i-2)+1 enddo  Dependence? 2*i1 – 2*i2 = -2 – 2 = -4 (yes, 2 divides -4)  Kind of dependence?  Anti? i2 + d = i1 ⇒ d = -2  Flow? i1 + d = i2 ⇒ d = 2 i1i2

GCD Test  Idea  Generalize test to linear functions of iterators  Code do i = li,hi do j = lj,hj A(a1*i + a2*j + a0) =... A(b1*i + b2*j + b0)... enddoenddo  Again  a1*i1 - b1*i2 + a2*j1 – b2*j2 = b0 – a0  Solution exists if gcd(a1,a2,b1,b2) divides b0 – a0

Example  Code do i = li,hi do j = lj,hj A(4*i + 2*j + 1) =... A(6*i + 2*j + 4)... enddoenddo

Till Now  Improve performance by...  improving data locality  parallizing the computation  Data Dependences  iteration space  distance vectors and direction vectors  loop carried  Transformation legality  must respect data dependences  scalar expansion as a technique to remove anti and output dependences  Data Dependence Testing  general formulation of the problem  GCD test

Affine Array Indexes  An array access in a loop is affine if  The bounds of the loop are expressed as affine expressions of the surrounding loop variables and symbolic constants  The index for each dimension of the array is also an affine expression of the surrounding loop variables and symbolic constants  Examples X[i-1] X[i, j+1] X[1, i, 2*i+j] X[i*j] : not an affine array access

Nonaffine Accesses in Practice  Sparse matrices  X[Y[i]]

Loop Permutation  Idea  Swap the order of two loops to increase parallelism, to improve spatial locality, or to enable other transformations  Also known as loop interchange do i = 1,5 do j = 1,5 do j = 1,5 x= A(2, j) +1 x= A(2, j) +1 enddo enddoenddo do j = 1,5 do i = 1,5 do i = 1,5 x= A(2, j) +1 x= A(2, j) +1 enddo enddoenddo Accessing strides thru a row of A An invariant w.r.t the inner loop

More examples  A(i,j) do i = 1,5 do j = 1,5 do j = 1,5 x= A(i, j) +1 x= A(i, j) +1 enddo enddoenddo do j = 1,5 do i = 1,5 do i = 1,5 x= A(i, j) +1 x= A(i, j) +1 enddo enddoenddo Stride n access Stride 1 access

Dependency Problem  What is the distance or direction vector of the dependences?  may require an exponential number of calls to a dependence testing algorithm that only returns yes/no  Input: IP problem  Output: distance or direction vector for dependences  Example outputs: (1,0), ( ), (>,=), (0,3) Which one of the above dependence vectors is not legal?  What is the dependence relation?  mapping from one iteration space to another  Input: Presburger formula (i.e. affine constraints, existential and universal quantifiers, logical operators)  Output: simplified presburger formula representing dependence relation  Example input: { [i,j] → [i’,j’] | 1 <= i,j,i’,j’<=10 & i=i’-1 & j=j’ & i<i’ & j<j’ }  Example output: { [i,j] → [i+1,j] | 1 <= i,j <= 10 }

Legality of Loop Interchange  Case analysis of the direction vectors  (=,=) The dependence is loop independent, so it is unaffected by interchange  (=,<) The dependence is carried by the j loop. After interchange the dependence will be (<,=), so the dependence will still be carried by the j loop, so the dependence relations do not change.  (<,=) The dependence is carried by the i loop. After interchange the dependence will be (=,<), so the dependence will still be carried by the i loop, so the dependence relations do not change.

Legality of Loop Interchange (cont.)  More cases  (<,<) The dependence distance is positive in both dimensions. After interchange it will still be positive in both dimensions, so the dependence relations do not change.  ( ) The dependence is carried by the outer loop. After interchange the dependence will be (>,<), which changes the dependences and results in an illegal direction vector, so interchange is illegal.  (>,*) (=,>) Such direction vectors are not possible for the original loop.

Loop Interchange Example  Consider the ( ) case

Frameworks for Loop Transformations  Unimodular Loop Transformations  [Banerjee 90], [Wolf & Lam 91]  For loop permutation, loop reversal, and loop skewing  Idea: T i = i’, T is a matrix, i and i’ are iteration vectors  Transformation is legal if the transformed dependence vector remain lexicographically positive  Limitations only perfectly nested loops all statements are transformed the same

Revisit the Legality of Loop Interchange  Intechange Matrix  (=,=)  (=,<)  ( )

Loop Reversal  Idea  Change the direction of loop iteration (i.e., From low-to-high indices to high-to-low indices or vice versa)  Benefits  Improved cache performance  Enables other transformations (coming soon)  Example

Loop Reversal and Distance Vectors  Impact  Reversal of loop i negates the i-th entry of all distance vectors associated with the loop  What about direction vectors?  When is reversal legal?  When the loop being reversed does not carry a dependence (i.e., When the transformed distance vectors remain legal)  Example do i = 1,5 do j = 1,6 A(i,j) = A(i-1,j-1)+1 enddoenddo Dependence: Distance Vector: (1,1) Transformed Distance Vector: (1,-1) legal ?

Transforming the Dependences and Array Accesses

Loop Reversal Example  Legality  Loop reversal will change the direction of the dependence relation  Is the following legal?

Loop Skewing

Transforming the Loop Bounds

Loop Fusion  Idea  Combine multiple loop nests into one  Example  Pros  May improve data locality  Reduces loop overhead  Enables array contraction (opposite of scalar expansion)  May enable better instruction scheduling  Cons  May hurt data locality  May hurt icache performance

Legality of Loop Fusion  Basic Conditions  Both loops must have same structure Same loop depth; Same loop bounds; Same iteration directions  Dependences must be preserved e.g., Flow dependences must not become anti dependences

Loop Fusion Example  What are the dependences?  Is there some transformation that will enable fusion of these loops?

Loop Fusion Example (cont)  Loop reversal is legal for the original loops  Does not change the direction of any dep. in the original code  Reverse the direction in the fused loop: s 3 δ a s 2 will become s 2 δ f s 3

Loop Distribution  Idea  Split a loop nest into multiple loop nests (the inverse of fusion)  Motivation?  Produces multiple (potentially) less constrained loops  May improve locality  Enable other transformations, such as interchange

Legality  Loop distribution is legal when the loop body contains no cycles in the dependence graph

Example  Reverse of our previous example

Example  If there are no cycles, we can reorder the loops with a topological sort

Loop Unrolling  Motivation  Reduces loop overhead  Improves effectiveness of other transformations  Code scheduling  CSE  The Transformation  Make n copies of the loop: n is the unrolling factor  Adjust loop bounds accordingly

Loop Balance  Problem  We’d like to produce loops with the right balance of memory operations and floating point operations  The ideal balance is machine-dependent e.g. How many load-store units are connected to the L1 cache? e.g. How many functional units are provided?

Unroll and Jam  Idea  Restructure loops so that loaded values are used many times per iteration  Unroll and Jam  Unroll the outer loop some number of times  Fuse (Jam) the resulting inner loops

Example

Unroll and Jam IS Tiling

Discussion  The problem is hard  Just finding a legal unimodular transformation is exponential in the number of loops  Heuristic  Perform reuse analysis to determine innermost tile (ie. localized vector space)  For the localized vector space, break problem into all possible tiling combinations  Apply S(kew)R(eversal)P(ermutation) algorithm in an attempt to make loops fully permutable  Definitely works when dependences are lexicographically positive distance vectors  O(n 2 *d) where n is the loop nest depth and d is the number of dependence vectors