Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Slides:



Advertisements
Similar presentations
CSE 5317/4305 L9: Instruction Selection1 Instruction Selection Leonidas Fegaras.
Advertisements

Optimizing Compilers for Modern Architectures Copyright, 1996 © Dale Carnegie & Associates, Inc. Dependence Testing Allen and Kennedy, Chapter 3 thru Section.
Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures.
10/20: Lecture Topics HW 3 Problem 2 Caches –Types of cache misses –Cache performance –Cache tradeoffs –Cache summary Input/Output –Types of I/O Devices.
Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.
1 ECE734 VLSI Arrays for Digital Signal Processing Loop Transformation.
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Register Usage Keep as many values in registers as possible Register assignment Register allocation Popular techniques – Local vs. global – Graph coloring.
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
Register allocation Morgensen, Torben. "Register Allocation." Basics of Compiler Design. pp from (
1 Optimizing compilers Managing Cache Bercovici Sivan.
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Lecture 11: Code Optimization CS 540 George Mason University.
Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Register Allocation CS 671 March 27, CS 671 – Spring Register Allocation - Motivation Consider adding two numbers together: Advantages: Fewer.
Compiler Challenges for High Performance Architectures
Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.
Optimizing Compilers for Modern Architectures Preliminary Transformations Chapter 4 of Allen and Kennedy.
Software Group © 2005 IBM Corporation Compiler Technology October 17, 2005 Array privatization in IBM static compilers -- technical report CASCON 2005.
1 CS 201 Compiler Construction Lecture 12 Global Register Allocation.
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
Enhancing Fine- Grained Parallelism Chapter 5 of Allen and Kennedy Mirit & Haim.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
Stanford University CS243 Winter 2006 Wei Li 1 Data Dependences and Parallelization.
Prof. Bodik CS 164 Lecture 171 Register Allocation Lecture 19.
Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky.
Register Allocation (via graph coloring)
Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral
Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.
U NIVERSITY OF M ASSACHUSETTS, A MHERST D EPARTMENT OF C OMPUTER S CIENCE Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
Parallelizing Compilers Presented by Yiwei Zhang.
Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.
1 Liveness analysis and Register Allocation Cheng-Chia Chen.
Improving Code Generation Honors Compilers April 16 th 2002.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)
Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp
Dependence: Theory and Practice Allen and Kennedy, Chapter 2 Liza Fireman.
Register Allocation and Spilling via Graph Coloring G. J. Chaitin IBM Research, 1982.
Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.
Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.
Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.
Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.
1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.
Carnegie Mellon Lecture 14 Loop Optimization and Array Analysis I. Motivation II. Data dependence analysis Chapter , 11.6 Dror E. MaydanCS243:
Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.
High-Level Transformations for Embedded Computing
ECE 1754 Loop Transformations by: Eric LaForest
Advanced Compiler Techniques LIU Xianhua School of EECS, Peking University Loops.
Optimizing Compilers for Modern Architectures Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9.
Optimizing Compilers for Modern Architectures Enhancing Fine-Grained Parallelism Part II Chapter 5 of Allen and Kennedy.
Memory-Aware Compilation Philip Sweany 10/20/2011.
Dependence Analysis and Loops CS 3220 Spring 2016.
Code Optimization Overview and Examples
Dependence Analysis Important and difficult
Presented by: Huston Bokinsky Ying Zhang 25 April, 2013
Register Pressure Guided Unroll-and-Jam
Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.
Lecture 19: Code Optimisation
Introduction to Optimization
CS 201 Compiler Construction
Presentation transcript:

Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4

Optimizing Compilers for Modern Architectures Overview Improving memory hierarchy performance by compiler transformations Scalar Replacement Unroll-and-Jam Saving memory loads & stores Make good use of the processor registers

Optimizing Compilers for Modern Architectures Motivating Example DO I = 1, N DO J = 1, M A(I) = A(I) + B(J) ENDDO A(I) can be left in a register throughout the inner loop Coloring based register allocation fails to recognize this DO I = 1, N T = A(I) DO J = 1, M T = T + B(J) ENDDO A(I) = T ENDDO All loads and stores to A in the inner loop have been saved High chance of T being allocated a register by the coloring algorithm

Optimizing Compilers for Modern Architectures Scalar Replacement Convert array reference to scalar reference to improve performance of the coloring based allocator Our approach is to use dependences to achieve these memory hierarchy transformations

Optimizing Compilers for Modern Architectures Dependence and Memory Hierarchy True or Flow - save loads and cache miss Anti - save cache miss Output - save stores Input - save loads A(I) =... + B(I)... = A(I) + k A(I) = = B(I)

Optimizing Compilers for Modern Architectures Dependence and Memory Hierarchy Loop Carried dependences - Consistent dependences most useful for memory management purposes Consistent dependences - dependences with constant threshold (dependence distance)

Optimizing Compilers for Modern Architectures Dependence and Memory Hierarchy Problem of overcounting optimization opportunities. For example S1: A(I) =... S2:... = A(I) S3:... = A(I) But we can save only two memory references not three Solution - Prune edges from dependence graph which dont correspond to savings in memory accesses

Optimizing Compilers for Modern Architectures In the reduction example DO I = 1, N DO J = 1, M A(I) = A(I) + B(J) ENDDO Using Dependences DO I = 1, N T = A(I) DO J = 1, M T = T + B(J) ENDDO A(I) = T ENDDO True dependence - replace the references to A in the inner loop by scalar T Output dependence - store can be moved outside the inner loop Antidependence - load can be moved before the inner loop

Optimizing Compilers for Modern Architectures Scalar Replacement Example: Scalar Replacement in case of loop independent dependence DO I = 1, N A(I) = B(I) + C X(I) = A(I)*Q ENDDO DO I = 1, N t = B(I) + C A(I) = t X(I) = t*Q ENDDO One less load for each iteration for reference to A

Optimizing Compilers for Modern Architectures Scalar Replacement Example: Scalar Replacement in case of loop carried dependence spanning single iteration DO I = 1, N A(I) = B(I-1) B(I) = A(I) + C(I) ENDDO tB = B(0) DO I = 1, N tA = tB A(I) = tA tB = tA + C(I) B(I) = tB ENDDO One less load for each iteration for reference to B which had a loop carried true dependence spanning 1 iteration Also one less load per iteration for reference to A

Optimizing Compilers for Modern Architectures Scalar Replacement Example: Scalar Replacement in case of loop carried dependence spanning multiple iterations DO I = 1, N A(I) = B(I-1) + B(I+1) ENDDO t1 = B(0) t2 = B(1) DO I = 1, N t3 = B(I+1) A(I) = t1 + t3 t1 = t2 t2 = t3 ENDDO One less load for each iteration for reference to B which had a loop carried input dependence spanning 2 iterations Invariants maintained were t1=B(I-1);t2=B(I);t3=B(I+1)

Optimizing Compilers for Modern Architectures Preloop Main Loop Eliminate Scalar Copies t1 = B(0) t2 = B(1) DO I = 1, N t3 = B(I+1) A(I) = t1 + t3 t1 = t2 t2 = t3 ENDDO Unnecessary register-register copies Unroll loop 3 times t1 = B(0) t2 = B(1) mN3 = MOD(N,3) DO I = 1, mN3 t3 = B(I+1) A(I) = t1 + t3 t1 = t2 t2 = t3 ENDDO DO I = mN3 + 1, N, 3 t3 = B(I+1) A(I) = t1 + t3 t1 = B(I+2) A(I+1) = t2 + t1 t2 = B(I+3) A(I+2) = t3 + t2 ENDDO

Optimizing Compilers for Modern Architectures DO I = 1, N A(I+1) = A(I-1) + B(I-1) A(I) = A(I) + B(I) + B(I+1) ENDDO Dependence pattern before pruning Not all edges suggest memory access savings Pruning the dependence graph

Optimizing Compilers for Modern Architectures Pruning the dependence graph DO I = 1, N A(I+1) = A(I-1) + B(I-1) A(I) = A(I) + B(I) + B(I+1) ENDDO Dependence pattern before pruning Not all edges suggest memory access savings DO I = 1, N A(I+1) = A(I-1) + B(I-1) A(I) = A(I) + B(I) + B(I+1) ENDDO Dashed edges are pruned Red-colored array references are generators Each reference has at most one predecessor in the pruned graph

Optimizing Compilers for Modern Architectures Pruning the dependence graph DO I = 1, N A(I+1) = A(I-1) + B(I-1) A(I) = A(I) + B(I) + B(I+1) ENDDO Apply scalar replacement after pruning the dependence graph t0A = A(0); t1A0 = A(1); tB1 = B(0) DO I = 1, N t1A1 = t0A + tB1 tB3 = B(I+1) t0A = t1A0 + tB3 + tB2 A(I) = t0A t1A0 = t1A1 tB1 = tB2 tB2 = tB3 ENDDO A(N+1) = t1A1 Only one load and one store per iteration

Optimizing Compilers for Modern Architectures Pruning the dependence graph Prune flow and input dependence edges that do not represent a potential reuse Prune redundant input dependence edges Prune output dependence edges after rest of the pruning is done

Optimizing Compilers for Modern Architectures Pruning the dependence graph Phase 1: Eliminate killed dependences When killed dependence is a flow dependence S1: A(I+1) =... S2: A(I) =... S3:... = A(I) –Store in S2 is a killing store. Flow dependence from S1 to S3 is pruned When killed dependence is an input dependence S1:... = A(I+1) S2: A(I) =... S3:... = A(I-1) –Store in S2 is a killing store. Input dependence from S1 to S3 is pruned

Optimizing Compilers for Modern Architectures Pruning the dependence graph Phase 2: Identify generators DO I = 1, N A(I+1) = A(I-1) + B(I-1) A(I) = A(I) + B(I) + B(I+1) ENDDO Any assignment reference with at least one flow dependence emanating from it to another statement in the loop Any use reference with at least one input dependence emanating from it and no input or flow dependence into it

Optimizing Compilers for Modern Architectures Pruning the dependence graph Phase 3: Find name partitions and eliminate input dependences Use Typed Fusion –References as vertices –An edge joins two references –Output and anti- dependences are bad edges –Name of array as type Eliminate input dependences between two elements of same name partition unless source is a generator

Optimizing Compilers for Modern Architectures Pruning the dependence graph Special cases Reference is in a dependence cycle in the loop DO I = 1, N A(J) = B(I) + C(I,J) C(I,J) = A(J) + D(I) ENDDO Assign single scalar to the reference in the cycle Replace A(J) by a scalar tA and insert A(J)=tA before or after the loop depending on upward/downward exposed occurrence

Optimizing Compilers for Modern Architectures Pruning the dependence graph Special cases: Inconsistent dependences DO I = 1, N A(I) = A(I-1) + B(I) A(J) = A(J) + A(I) ENDDO Store to A(J) kills A(I) Only one scalar replacement possible DO I = 1, N tAI = A(I-1) + B(I) A(I) = tAI A(J) = A(J) + tAI ENDDO This code can be improved substantially by index set splitting

Optimizing Compilers for Modern Architectures Pruning the dependence graph DO I = 1, N tAI = A(I-1) + B(I) A(I) = tAI A(J) = A(J) + tAI ENDDO Split this loop into three separate parts A loop up to J Iteration J A loop after iteration J to N tAI = A(0); tAJ = A(J) JU = MAX(J-1,0) DO I = 1, JU tAI = tAI + B(I); A(I) = tAI tAJ = tAJ + tAI ENDDO IF(J.GT.0.AND.J.LE.N) THEN tAI = tAI + B(I); A(I) = tAI tAJ = tAJ + tAI tAI = tAJ ENDIF DO I = JU+2, N tAI = tAI + B(I); A(I) = tAI tAJ = tAJ + tAI ENDDO A(J) = tAJ

Optimizing Compilers for Modern Architectures Moderation of Register Pressure Scalar replacement of all name partitions will produce a lot of scalar quantities which compete for floating point registers Have to choose name partitions for scalar replacement to maximize register usage

Optimizing Compilers for Modern Architectures Moderation of Register Pressure Attach two parameters to each name partition R v(R): the value of the name partition R –Number of loads or stores saved by replacing each reference in R by register resident scalars c(R): the cost of the name partition R –Number of registers needed to hold all scalar temporary values

Optimizing Compilers for Modern Architectures Moderation of Register Pressure Choose subset {R1,…,Rm} such that and maximize 0-1 bin packing problem Can use dynamic programming: O(nm) Can use heuristic –Order name partitions according to the ratio v(R)/c(R) and select elements at the beginning of the list till all registers are exhausted

Optimizing Compilers for Modern Architectures Scalar Replacement: Putting it together 1. Prune dependence graph; Apply typed fusion 2. Select a set of name partitions using register pressure moderation 3. For each selected partition A) If non-cyclic, replace using set of temporaries B) If cyclic replace reference with single temporary C) For each inconsistent dependence Use index set splitting or insert loads and stores 4. Unroll loop to eliminate scalar copies

Optimizing Compilers for Modern Architectures Scalar Replacement: Case A DO I = 1, N A(I+1) = A(I-1) + B(I-1) A(I) = A(I) + B(I) + B(I+1) ENDDO t0A = A(0); t1A0 = A(1); tB1 = B(0) DO I = 1, N t1A1 = t0A + tB1 tB3 = B(I+1) t0A = t1A0 + tB3 + tB2 A(I) = t0A t1A0 = t1A1 tB1 = tB2 tB2 = tB3 ENDDO A(N+1) = t1A1

Optimizing Compilers for Modern Architectures Scalar Replacement: Case B DO I = 1, N A(J) = B(I) + C(I,J) C(I,J) = A(J) + D(I) ENDDO replace with single temporary... DO I = 1, N tA = B(I) + C(I,J) C(I,J) = tA + D(I) ENDDO A(J) = tA

Optimizing Compilers for Modern Architectures Scalar Replacement: Case C DO I = 1, N tAI = A(I-1) + B(I) A(I) = tAI A(J) = A(J) + tAI ENDDO Split this loop into three separate parts A loop up to J Iteration J A loop after iteration J to N tAI = A(0); tAJ = A(J) JU = MAX(J-1,0) DO I = 1, JU tAI = tAI + B(I); A(I) = tAI tAJ = tAJ + tAI ENDDO IF(J.GT.0.AND.J.LE.N) THEN tAI = tAI + B(I); A(I) = tAI tAJ = tAJ + tAI tAI = tAJ ENDIF DO I = JU+2, N tAI = tAI + B(I); A(I) = tAI tAJ = tAJ + tAI ENDDO A(J) = tAJ

Optimizing Compilers for Modern Architectures Experiments on Scalar Replacement

Optimizing Compilers for Modern Architectures Experiments on Scalar Replacement

Optimizing Compilers for Modern Architectures Unroll-and-Jam DO I = 1, N*2 DO J = 1, M A(I) = A(I) + B(J) ENDDO Can we achieve reuse of references to B ? Use transformation called Unroll-and-Jam DO I = 1, N*2, 2 DO J = 1, M A(I) = A(I) + B(J) A(I+1) = A(I+1) + B(J) ENDDO Unroll outer loop twice and then fuse the copies of the inner loop Brought two uses of B(J) together

Optimizing Compilers for Modern Architectures Unroll-and-Jam DO I = 1, N*2, 2 DO J = 1, M A(I) = A(I) + B(J) A(I+1) = A(I+1) + B(J) ENDDO Apply scalar replacement on this code DO I = 1, N*2, 2 s0 = A(I) s1 = A(I+1) DO J = 1, M t = B(J) s0 = s0 + t s1 = s1 + t ENDDO A(I) = s0 A(I+1) = s1 ENDDO Half the number of loads as the original program

Optimizing Compilers for Modern Architectures Legality of Unroll-and-Jam Is unroll-and-jam always legal? DO I = 1, N*2 DO J = 1, M A(I+1,J-1) = A(I,J) + B(I,J) ENDDO Apply unroll-and-jam DO I = 1, N*2, 2 DO J = 1, M A(I+1,J-1) = A(I,J) + B(I,J) A(I+2,J-1) = A(I+1,J) + B(I+1,J) ENDDO This is wrong!!!

Optimizing Compilers for Modern Architectures Legality of Unroll-and-Jam

Optimizing Compilers for Modern Architectures Legality of Unroll-and-Jam Direction vector in this example was ( ) This makes loop interchange illegal Unroll-and-Jam is loop interchange followed by unrolling inner loop followed by another loop interchange But does loop interchange illegal imply unroll-and-jam illegal ? NO

Optimizing Compilers for Modern Architectures Legality of Unroll-and-Jam Consider this example DO I = 1, N*2 DO J = 1, M A(I+2,J-1) = A(I,J) + B(I,J) ENDDO Direction vector is ( ); still unroll-and-jam possible

Optimizing Compilers for Modern Architectures Conditions for legality of unroll-and-jam Definition: Unroll-and-jam to factor n consists of unrolling the outer loop n-1 times and fusing those copies together. Theorem: An unroll-and-jam to a factor of n is legal iff there exists no dependence with direction vector ( ) such that the distance for the outer loop is less than n.

Optimizing Compilers for Modern Architectures Unroll-and-jam Algorithm 1. Create preloop 2. Unroll main loop m(the unroll-and-jam factor) times 3. Apply typed fusion to loops within the body of the unrolled loop 4. Apply unroll-and-jam recursively to the inner nested loop

Optimizing Compilers for Modern Architectures Unroll-and-jam example DO I = 1, N DO K = 1, N A(I) = A(I) + X(I,K) ENDDO DO J = 1, M DO K = 1, N B(J,K) = B(J,K) + A(I) ENDDO DO J = 1, M C(J,I) = B(J,N)/A(I) ENDDO DO I = mN2+1, N, 2 DO K = 1, N A(I) = A(I) + X(I,K) A(I+1) = A(I+1) + X(I+1,K) ENDDO DO J = 1, M DO K = 1, N B(J,K) = B(J,K) + A(I) B(J,K) = B(J,K) + A(I+1) ENDDO C(J,I) = B(J,N)/A(I) C(J,I+1) = B(J,N)/A(I+1) ENDDO

Optimizing Compilers for Modern Architectures Unroll-and-jam: Experiments

Optimizing Compilers for Modern Architectures Unroll-and-jam: Experiments

Optimizing Compilers for Modern Architectures Conclusion We have learned two memory hierarchy transformations: scalar replacement unroll-and-jam They reduce the number of memory accesses by maximum use of processor registers