Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

Slides:



Advertisements
Similar presentations
Optimizing Compilers for Modern Architectures Copyright, 1996 © Dale Carnegie & Associates, Inc. Dependence Testing Allen and Kennedy, Chapter 3 thru Section.
Advertisements

Parallel List Ranking Advanced Algorithms & Data Structures Lecture Theme 17 Prof. Dr. Th. Ottmann Summer Semester 2006.
Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.
Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.
1 ECE734 VLSI Arrays for Digital Signal Processing Loop Transformation.
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
1 Optimizing compilers Managing Cache Bercovici Sivan.
CSCI 4717/5717 Computer Architecture
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Lecture 11: Code Optimization CS 540 George Mason University.
Context-Sensitive Interprocedural Points-to Analysis in the Presence of Function Pointers Presentation by Patrick Kaleem Justin.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
Data Dependencies Describes the normal situation that the data that instructions use depend upon the data created by other instructions, or data is stored.
Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Compiler Challenges for High Performance Architectures
Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.
Optimizing Compilers for Modern Architectures Preliminary Transformations Chapter 4 of Allen and Kennedy.
Preliminary Transformations Chapter 4 of Allen and Kennedy Harel Paz.
MATH 685/ CSI 700/ OR 682 Lecture Notes
COMP4211 Seminar Intro to Instruction-Level Parallelism 04S1 Week 02 Oliver Diessel.
Enhancing Fine- Grained Parallelism Chapter 5 of Allen and Kennedy Mirit & Haim.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
CMPUT680 - Fall 2006 Topic A: Data Dependence in Loops José Nelson Amaral
CS 536 Spring Global Optimizations Lecture 23.
4/23/09Prof. Hilfinger CS 164 Lecture 261 IL for Arrays & Local Optimizations Lecture 26 (Adapted from notes by R. Bodik and G. Necula)
Stanford University CS243 Winter 2006 Wei Li 1 Data Dependences and Parallelization.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
Prof. Bodik CS 164 Lecture 171 Register Allocation Lecture 19.
4/25/08Prof. Hilfinger CS164 Lecture 371 Global Optimization Lecture 37 (From notes by R. Bodik & G. Necula)
Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky.
Register Allocation (via graph coloring)
Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.
CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral
Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.
Parallelizing Compilers Presented by Yiwei Zhang.
Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.
Prof. Fateman CS 164 Lecture 221 Global Optimization Lecture 22.
Data Flow Analysis Compiler Design Nov. 8, 2005.
4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)
Advanced Topics in Algorithms and Data Structures 1 Two parallel list ranking algorithms An O (log n ) time and O ( n log n ) work list ranking algorithm.
Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp
Dependence: Theory and Practice Allen and Kennedy, Chapter 2 Liza Fireman.
Prof. Bodik CS 164 Lecture 16, Fall Global Optimization Lecture 16.
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.
JS Arrays, Functions, Events Week 5 INFM 603. Agenda Arrays Functions Event-Driven Programming.
Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.
Copyright © 2012 Pearson Education, Inc. Chapter 8 Two Dimensional Arrays.
Multi-Dimensional Arrays
Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
Copyright © 2013, 2009, 2005 Pearson Education, Inc. 1 5 Systems and Matrices Copyright © 2013, 2009, 2005 Pearson Education, Inc.
RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.
1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.
Carnegie Mellon Lecture 14 Loop Optimization and Array Analysis I. Motivation II. Data dependence analysis Chapter , 11.6 Dror E. MaydanCS243:
CSC 221: Recursion. Recursion: Definition Function that solves a problem by relying on itself to compute the correct solution for a smaller version of.
Data Structure Introduction.
CMP-MX21: Lecture 4 Selections Steve Hordley. Overview 1. The if-else selection in JAVA 2. More useful JAVA operators 4. Other selection constructs in.
High-Level Transformations for Embedded Computing
Optimizing Compilers for Modern Architectures Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9.
HPF (High Performance Fortran). What is HPF? HPF is a standard for data-parallel programming. Extends Fortran-77 or Fortran-90. Similar extensions exist.
Optimizing Compilers for Modern Architectures Enhancing Fine-Grained Parallelism Part II Chapter 5 of Allen and Kennedy.
Algorithm Analysis CSE 2011 Winter September 2018.
Unit IV Code Generation
Register Pressure Guided Unroll-and-Jam
Introduction to Optimization
Optimizing single thread performance
Presentation transcript:

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures Fortran 90 Fortran 90: successor to Fortran 77 Slow to gain acceptance: —Need better/smarter compiler techniques to achieve same level of performance as Fortran 77 compilers This chapter focuses on a single new feature - the array assignment statement: A[1:100] = 2.0 —Intended to provide direct mechanism to specify parallel/vector execution This statement must be implemented for the specific available hardware. In an uniprocessor, the statement must be converted to a scalar loop: Scalarization

Optimizing Compilers for Modern Architectures Fortran 90 Range of a vector operation in Fortran 90 denoted by a triplet: A[1:100:2] = B[2:51] Semantics of Fortran 90 require that for vector statements, all inputs to the statement are fetched before any results are stored

Optimizing Compilers for Modern Architectures Outline Simple scalarization Safe scalarization Techniques to improve on safe scalarization —Loop reversal —Input prefetching —Loop splitting Multidimensional scalarization A framework for analyzing multidimensional scalarization

Optimizing Compilers for Modern Architectures Scalarization Replace each array assignment by a corresponding DO loop Is it really that easy? Two key issues: — Wish to avoid generating large array temporaries — Wish to optimize loops to exhibit good memory hierarchy — performance

Optimizing Compilers for Modern Architectures Simple Scalarization Consider the vector statement: A(1:200) = 2.0 * A(1:200) A scalar implementation: S 1 DO I = 1, 200 S 2 A(I) = 2.0 * A(I) ENDDO However, some statements cause problems: A(2:201) = 2.0 * A(1:200) If we naively scalarize: DO i = 1, 200 A(i+1) = 2.0 * A(i) ENDDO

Optimizing Compilers for Modern Architectures Simple Scalarization Meaning of statements changed by the scalarization process: scalarization faults Naive algorithm which ignores such scalarization faults: procedure SimpleScalarize(S) let V 0 = (L 0 : U 0 : D 0 ) be the vector iteration specifier on left side of S; // Generate the scalarizing loop let I be a new loop index variable; generate the statement DO I = L 0,U 0,D ; for each vector specifier V = (L: U: D) in S do replace V with (I+L-L 0 ); generate an ENDDO statement; end SimpleScalarize

Optimizing Compilers for Modern Architectures Scalarization Faults Why do scalarization faults occur? Vector operation semantics: All values from the RHS of the assignment should be fetched before storing into the result If a scalar operation stores into a location fetched by a later operation, we get a scalarization fault Principle 13.1: A vector assignment generates a scalarization fault if and only if the scalarized loop carries a true dependence. These dependences are known as scalarization dependences To preserve correctness, compiler should never produce a scalarization dependence

Optimizing Compilers for Modern Architectures Safe Scalarization Naive algorithm for safe scalarization: Use temporary storage to make sure scalarization dependences are not created Consider: A(2:201) = 2.0 * A(1:200) can be split up into: T(1:200) = 2.0 * A(1:200) A(2:201) = T(1:200) Then scalarize using SimpleScalarize DO I = 1, 200 T(I) = 2.0 * A(I) ENDDO DO I = 2, 201 A(I) = T(I-1) ENDDO

Optimizing Compilers for Modern Architectures Safe Scalarization Procedure SafeScalarize implements this method of scalarization Good news: —Scalarization always possible by using temporaries Bad News: —Substantial increase in memory use due to temporaries —More memory operations per array element We shall look at a number of techniques to reduce the effects of these disadvantages

Optimizing Compilers for Modern Architectures Loop Reversal A(2:256) = A(1:255) SimpleScalarize will produce a scalarization fault Solution: Loop reversal DO I = 256, 2, -1 A(I) = A(I+1) ENDDO

Optimizing Compilers for Modern Architectures Loop Reversal When can we use loop reversal? —Loop reversal maps dependences into antidependences —But also maps antidependences into dependences A(2:257) = ( A(1:256) + A(3:258) ) / 2.0 After scalarization: DO I = 2, 257 A(I) = ( A(I-1) + A(I+1) ) / 2.0 ENDDO Loop Reversal gets us: DO I = 257, 2 A(I) = ( A(I-1) + A(I+1) ) / 2.0 ENDDO Thus, cannot use loop reversal in presence of antidependences

Optimizing Compilers for Modern Architectures Input Prefetching A(2:257) = ( A(1:256) + A(3:258) ) / 2.0 Causes a scalarization fault when naively scalarized to: DO I = 2, 257 A(I) = ( A(I-1) + A(I+1) ) / 2.0 ENDDO Problem: Stores into first element of the LHS in the previous iteration Input prefetching: Use scalar temporaries to store elements of input and output arrays

Optimizing Compilers for Modern Architectures Input Prefetching A first-cut at using temporaries: DO I = 2, 257 T1 = A(I-1) T2 = ( T1 + A(I+1) ) / 2.0 A(I) = T2 ENDDO T1 holds element of input array, T2 holds element of output array But this faces the same problem. Can correct by moving assignment to T1 into previous iteration...

Optimizing Compilers for Modern Architectures Input Prefetching T1 = A(1) DO I = 2, 256 T2 = ( T1 + A(I+1) ) / 2.0 T1 = A(I) A(I) = T2 ENDDO T2 = ( T1 + A(257) ) / 2.0 A(I) = T2 Note: We are using scalar replacement, but the motivation for doing so is different than in Chapter 8

Optimizing Compilers for Modern Architectures Input Prefetching Already seen in Chapter 8, we need as many temporaries as the dependence threshold + 1. Example: DO I = 2, 257 A(I+2) = A(I) ENDDO Can be changed to: T1 = A(1) T2 = A(2) DO I = 2, 255 T3 = T T1 = T2 T2 = A(I+2) A(I+2) = T3 ENDDO T3 = T T1 = T2 A(258) = T3 T3 = T A(259) = T3

Optimizing Compilers for Modern Architectures Input Prefetching Can also unroll the loop and eliminate register to register copies Principle 13.2: Any scalarization dependence with a threshold known at compile time can be corrected by input prefetching.

Optimizing Compilers for Modern Architectures Input Prefetching Sometimes, even when a scalarization dependence does not have a constant threshold, input prefetching can be used effectively A(1:N) = A(1:N) / A(1) which can be naively scalarized as: DO i = 1, N A(i) = A(i) / A(1) ENDDO true dependence from first iteration to every other iteration antidependence from first iteration to itself Via input prefetching, we get: tA1 = A(1) DO i = 1, N A(i) = A(i) / tA1 ENDDO

Optimizing Compilers for Modern Architectures Loop Splitting Problem with using input prefetching with thresholds > 1: Temporaries must be saved for each iteration of the loop up to the threshold Can potentially use loop splitting to solve this problem A(3:6) = ( A(1:4) + A(5:8) ) / 2.0 Scalarization loop: DO I = 3, 6 A(I) = (A(I-2) + A(I+2) ) / 2.0 ENDDO True dependence and antidependence with threshold of 2 Split up into two independent loops which do not interact with each other...

Optimizing Compilers for Modern Architectures Loop Splitting DO I = 3, 6 A(I) = (A(I-2) + A(I+2) ) / 2.0 ENDDO Both true and anti dependence has threshold of 2 With loop splitting becomes: DO I = 3, 5, 2 A(I) = (A(I-2) + A(I+2) ) / 2.0 ENDDO DO I = 4, 6, 2 A(I) = (A(I-2) + A(I+2) ) / 2.0 ENDDO Note that the threshold becomes 1 for each loop. Could have produced incorrect results if threshold of antidependence was not divisible by 2

Optimizing Compilers for Modern Architectures Loop Splitting Can write the splitting as a nested pair of loops: DO i1 = 3, 4 DO i2 = i1, 6, 2 A(i2) = (A(i2-2) + A(i2+2) ) / 2.0 ENDDO The inner loop carries a scalarization dependence with threshold 1 and the outer loop carries no dependence. Can apply input prefetching: DO i1 = 3, 4 T1 = A(i1-2) DO i2 = i1, 6, 2 T2 = (T1 + A(i2 + 2)) / 2.0 T1 = A(i2) A(i2) = T2 ENDDO

Optimizing Compilers for Modern Architectures Loop Splitting Principle 13.3: Any scalarization loop in which all true dependences have the same constant threshold T and all antidependences have a threshold that is divisible by T can be transformed, using input prefetching and loop splitting, so that all scalarization dependences are eliminated.

Optimizing Compilers for Modern Architectures Scalarization Algorithm Revised scalarization Algorithm: procedure FullScalarize for each vector statement S do begin compute the dependences of S on itself as though S had been scalarized; if S has no scalarization dependences upon itself then SimpleScalarize(S); else if S has scalarization dependences, but no self antidependences then begin SimpleScalarize(S); reverse the scalarization loop; end

Optimizing Compilers for Modern Architectures Scalarization Algorithm else if all scalarization dependences have a threshold of 1 then begin SimpleScalarize(S); InputPrefetch(S); end else if all scalarization dependences for S have the same constant threshold T and all antidependences have thresholds that are divisible by T then SplitLoop(S); else if all antidependences for S have the same constant threshold T and all true dependences have thresholds that are divisible by T then begin reverse the loop; SplitLoop(S); end else SafeScalarize(S, SL); end end FullScalarize

Optimizing Compilers for Modern Architectures Multidimensional Scalarization Vector statements in Fortran 90 in more than 1 dimension: A(1:100, 1:100) = B(1:100, 1, 1:100) corresponds to: DO J = 1, 100 A(1:100, J) = B(1:100, 1, J) ENDDO Scalarization in multiple dimensions: A(1:100, 1:100) = 2.0 * A(1:100, 1:100) Obvious Strategy: convert each vector iterator into a loop: DO J = 1, 100, 1 DO I = 1, 100 A(I,J) = 2.0 * A(I,J) ENDDO

Optimizing Compilers for Modern Architectures Multidimensional Scalarization What should the order of the loops be after scalarization? —Familiar question: We dealt with this issue in Loop Selection/Interchange in Chapter 5 Profitability of a particular configuration depends on target architecture —For simplicity, we shall assume shorter strides through memory are better —Thus, optimal choice for innermost loop is the leftmost vector iterator

Optimizing Compilers for Modern Architectures Multidimensional Scalarization Extending previous results to multiple dimensions: —Each vector iterator is scalarized separately, starting from the leftmost vector iterator in the innermost loop and the rest of the iterators from left to right Once the ordering is available: 1. Test to see if the loop carries a scalarization dependence. If not, then proceed to the next loop. 2.If the scalarization loop carries only true dependences, reverse the loop and proceed to the next loop. 3.Apply input prefetching, with loop splitting where appropriate, to eliminate dependences to which it applies. Observe, however, that in outer loops, prefetching is done for a single submatrix (the remaining dimensions). 4. Otherwise, the loop carries a scalarization fault that requires temporary storage. Generate a scalarization that utilizes temporary storage and terminate the scalarization test for this loop, since temporary storage will eliminate all scalarization faults.

Optimizing Compilers for Modern Architectures Outer Loop Prefetching A(1:N, 1:N) = (A(0:N-1, 2:N+1) + A(2:N+1, 0:N-1)) / 2.0 If we try to scalarize this (keeping the column iterator in the innermost loop) we get a true scalarization dependence ( ) involving the second input and an antidependence (>, <) involving the first input Cannot use loop reversal...

Optimizing Compilers for Modern Architectures Outer Loop Prefetching A(1:N, 1:N) = (A(0:N-1, 2:N+1) + A(2:N+1, 0:N-1)) / 2.0 We can use input prefetching on the outer loop. The temporaries will be arrays: T0(1:N) = A(2:N+1, 0) DO j = 1, N-1 T1(1:N)=( A(0:N-1, j+1) + T0(1:N) ) / 2.0 T0(1:N) = A(2:N+1, j) A(1:N, j) = T1(1:N) ENDDO T1(1:N) = ( A(0:N-1, N) + T0(1:N) ) / 2.0 A(1:N, N) = T1(1:N) Total temporary space required = 2 rows of original matrix Better than storage required for copy of the result matrix

Optimizing Compilers for Modern Architectures Loop Interchange Sometimes, there is a tradeoff between scalarization and optimal memory hierarchy usage A(2:100, 3:101) = A(3:101, 1:201:2) If we scalarize this using the prescribed order: DO I = 3, 101 DO 100 J = 2, 100 A(J,I) = A(J+1,2*I-5) ENDDO Dependences ( ) (I = 3, 4) and (>, >) (I = 6, 7) Cannot use loop reversal, input prefetching Can use temporaries

Optimizing Compilers for Modern Architectures Loop Interchange However, we can use loop interchange to get: DO J = 2, 100 DO I = 3, 101 A(J,I) = A(J+1,2*I-5) ENDDO Not optimal memory hierarchy usage, but reduction of temporary storage Loop interchange is useful to reduce size of temporaries It can also eliminate scalarization dependences

Optimizing Compilers for Modern Architectures General Multidimensional Scalarization Goal: To vectorize a single statement which has m vector dimensions —Given an ideal order of scalarization (l 1, l 2,..., l m ) —(d 1, d 2,..., d n ) be direction vectors for all true and antidependences of the statement upon itself —The scalarization matrix is a n  m matrix of these direction vectors For instance: A(1:N, 1:N, 1:N) = A(0:N-1, 1:N, 2:N+1) + A(1:N, 2:N+1, 0:N-1) > = < =

Optimizing Compilers for Modern Architectures General Multidimensional Scalarization If we examine any column of the direction matrix, we can immediately see if the corresponding loop can be safely scalarized as the outermost loop of the nest: —If all entries of the column are = or >, it can be safely scalarized as the outermost loop without loop reversal. —If all entries are = or <, it can be safely scalarized with loop reversal. —If it contains a mixture of, it cannot be scalarized by simple means.

Optimizing Compilers for Modern Architectures General Multidimensional Scalarization Once a loop has been selected for scalarization, the dependences carried by that loop, any dependence whose direction vector does not contain a = in the position corresponding to the selected loop may be eliminated from further consideration. In our example, if we move the second column to the outside, we get: > = < = > < = Scalarization in this way will reduce the matrix to:

Optimizing Compilers for Modern Architectures A Complete Scalarization Algorithm procedure CompleteScalarize(S, loop_list) let M be the scalarization direction matrix resulting from scalarization S to loop_list; while there are more loops to be scalarized do begin let l be the first loop in loop list that can be simply scalarized with or without loop reversal (determine by examining the columns of M from left to right); if there is no such l then begin let l be the first loop on loop_list; section l by input prefetching if the previous step fails then section S using the naive temporary method and exit; end

Optimizing Compilers for Modern Architectures A Complete Scalarization Algorithm else // make l the outermost loop section l directly or with loop reversal, depending on the entries in the column of M corresponding to l (if l is the last loop, use hardware section length); remove l from loop_list; let M' be M with the column corresponding to l and the rows corresponding to non “=“ entries in that column eliminated; M := M'; end //while end CompleteScalarize Time Complexity = O(m 2 n) where —m = number of loops —n = number of dependences

Optimizing Compilers for Modern Architectures A Complete Scalarization Algorithm Correctness follows from the definition of the scalarization matrix and the method to remove dependences from the matrix For a given statement S and loop list, Scalarize produces a correct scalarization with the following properties: 1input prefetching is applied to the innermost loop possible  the order of scalarization loops is the closest possible to the order specified on input among scalarizations with property (1).

Optimizing Compilers for Modern Architectures Scalarization Example DO J = 2, N-1 A(2:N-1,J) = A(1:N-2,J) + A(3:N,J) + & A(2:N-1,J-1) + A(2:N-1,J+1)/4. ENDDO Loop carried true dependence, antidependence Naive compiler could generate: DO J = 2, N-1 DO i = 2, N-1 T(i-1) = (A(i-1,J) + A(i+1,J) + & A(i,J-1) + A(i,J+1) )/4 ENDDO DO i = 2, N-1 A(i,J) = T(i-1) ENDDO 2  (N-2) 2 accesses to memory due to array T

Optimizing Compilers for Modern Architectures Scalarization Example However, can use input prefetching to get: DO J = 2, N-1 tA0 = A(1, J) DO i = 2, N-2 tA1 = (tA0+A(i+1,J)+A(i,J-1)+A(i,J+1))/4 tA0 = A(i-1, J) A(i,J) = tA1 ENDDO tA1 = (tA0+A(N,J)+A(N-1,J-1)+A(N-1,J+1))/4 A(N-1,J) = tA1 ENDDO If temporaries are allocated to registers, no more memory accesses than original Fortran 90 program

Optimizing Compilers for Modern Architectures Post Scalarization Issues Issues due to scalarization: —Generates many individual loops —These loops carry no dependences. So reuse of quantities in registers is not common Solution: Use loop interchange, loop fusion, unroll-and-jam, and scalar replacement