Simone Campanoni simonec@eecs.northwestern.edu Loop transformations Simone Campanoni simonec@eecs.northwestern.edu.

Slides:



Advertisements
Similar presentations
1 Optimizing compilers Managing Cache Bercovici Sivan.
Advertisements

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Lecture 11: Code Optimization CS 540 George Mason University.
Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.
Loops or Lather, Rinse, Repeat… CS153: Compilers Greg Morrisett.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Chapter 10 Code Optimization. A main goal is to achieve a better performance Front End Code Gen Intermediate Code source Code target Code user Machine-
CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop.
1 Copy Propagation What does it mean? Given an assignment x = y, replace later uses of x with uses of y, provided there are no intervening assignments.
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
Global optimization. Data flow analysis To generate better code, need to examine definitions and uses of variables beyond basic blocks. With use- definition.
Reduction in Strength CS 480. Our sample calculation for i := 1 to n for j := 1 to m c [i, j] := 0 for k := 1 to p c[i, j] := c[i, j] + a[i, k] * b[k,
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
CS 412/413 Spring 2007Introduction to Compilers1 Lecture 29: Control Flow Analysis 9 Apr 07 CS412/413 Introduction to Compilers Tim Teitelbaum.
CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral
U NIVERSITY OF M ASSACHUSETTS, A MHERST D EPARTMENT OF C OMPUTER S CIENCE Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
Machine-Independent Optimizations Ⅰ CS308 Compiler Theory1.
Global optimization. Data flow analysis To generate better code, need to examine definitions and uses of variables beyond basic blocks. With use- definition.
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
What’s in an optimizing compiler?
Computer Science 313 – Advanced Programming Topics.
1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.
Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.
High-Level Transformations for Embedded Computing
Improving Locality through Loop Transformations Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at.
12/18/2015© Hal Perkins & UW CSET-1 CSE P 501 – Compilers Loops Hal Perkins Autumn 2009.
ECE 1754 Loop Transformations by: Eric LaForest
Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.
CS412/413 Introduction to Compilers and Translators April 2, 1999 Lecture 24: Introduction to Optimization.
Loops Simone Campanoni
CS 412/413 Spring 2005Introduction to Compilers1 CS412/CS413 Introduction to Compilers Tim Teitelbaum Lecture 30: Loop Optimizations and Pointer Analysis.
Code Optimization More Optimization Techniques. More Optimization Techniques  Loop optimization  Code motion  Strength reduction for induction variables.
Simone Campanoni CFA Simone Campanoni
Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.
Code Optimization Overview and Examples
Introduction To Computer Systems
Loop Restructuring Loop unswitching Loop peeling Loop fusion
Optimization Code Optimization ©SoftMoore Consulting.
Princeton University Spring 2016
Princeton University Spring 2016
Fall Compiler Principles Lecture 8: Loop Optimizations
Machine-Independent Optimization
Optimizing Transformations Hal Perkins Autumn 2011
A Practical Stride Prefetching Implementation in Global Optimizer
Optimizing Transformations Hal Perkins Winter 2008
Compiler Code Optimizations
Code Optimization Overview and Examples Control Flow Graph
Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Fall Compiler Principles Lecture 10: Loop Optimizations
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Loop Optimization “Programs spend 90% of time in loops”
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CSE P 501 – Compilers SSA Hal Perkins Autumn /31/2019
Introduction to Optimization
Optimizing single thread performance
Code Optimization.
Presentation transcript:

Simone Campanoni simonec@eecs.northwestern.edu Loop transformations Simone Campanoni simonec@eecs.northwestern.edu

Outline Loop invariants Induction variables Loop transformations

Optimizations in small, hot loops Most programs: 90% of time is spent in few, small, hot loops while (){ statement 1 statement 2 statement 3 } Deleting a single statement from a small, hot loop might have a big impact (100 seconds -> 70 seconds)

Loop example 1: if (N>5){ k = 1; z = 4;} 2: else {k = 2; z = 3;} do { 3: a = 1; 4: y = x + N; 5: b = k + z; 6: c = a * 3; 7: if (N < 0){ 8: m = 5; 9: break; } 10: x++; 11:} while (x < N); Observation: each statement in that loop will contribute to the program execution time Idea: what about moving statements from inside a loop to outside it? Which statements can be moved outside our loop? How to identify them automatically? (code analysis) How to move them? (code transformation)

Loop-invariant computations Let d be the following definition (d) t = x op y d is a loop-invariant of a loop L if (assuming x, y do not escape) x and y are constants or All reaching definitions of x and y are outside the loop, or Only one definition reaches x (or y), and that definition is loop-invariant

Loop example 1: if (N>5){ k = 1; z = 4;} 2: else {k = 2; z = 3;} do { 3: a = 1; 4: y = x + N; 5: b = k + z; 6: c = a * 3; 7: if (N < 0){ 8: m = 5; 9: break; } 10: x++; 11:} while (x < N); d is a loop-invariant of a loop L if x and y are constants or All reaching definitions of x and y are outside the loop, or Only one definition reaches x (or y), and that definition is loop-invariant ??

Loop-invariant computations in LLVM

Hoisting code In order to “hoist” a loop-invariant computation out of a loop, we need a place to put it We could copy it to all immediate predecessors of the loop header... ...But we can avoid code duplication (and bugs) by taking advantage of loop normalization that guarantees the existence of the pre-header for (auto it = pred_begin(H); it != pred_end(H); ++it){ BasicBlock *pBB = *it; p = pBB->getTerminator(); inv->moveBefore(p); } n1 n2 n3 Header Is it correct?

Hoisting code In order to “hoist” a loop-invariant computation out of a loop, we need a place to put it We could copy it to all immediate predecessors of the loop header... ...But we can avoid code duplication (and bugs) by taking advantage of loop normalization that guarantees the existence of the pre-header pBB = loop->getLoopPreheader(); p = pBB->getTerminator(); inv->moveBefore(p); Pre-header Header Exit node Latch Body

Hoisting conditions Loop invariant code motion For a loop-invariant definition (d) t = x op y We can hoist d into the loop’s pre-header if d’s block dominates all loop exits at which t is live-out, and there is only one definition of t in the loop, and t is not live-out of the pre-header Loop invariant code motion ?? 1,3 2

Outline Loop invariants Induction variables Loop transformations

Loop example Assuming a,b,c,m are used after our code 1: if (N>5){ k = 1; z = 4;} 2: else {k = 2; z = 3;} do { 3: a = 1; 4: y = x + N; 5: b = k + z; 6: c = a * 3; 7: if (N < 0){ 8: m = 5; 9: break; } 10: x++; 11:} while (x < N); Assuming a,b,c,m are used after our code Do we have to execute 4 for every iteration? Do we have to execute 10 for every iteration?

Loop example y=N Do we have to execute 4 for every iteration? 1: if (N>5){ k = 1; z = 4;} 2: else {k = 2; z = 3;} do { 3: a = 1; 4: y = x + N; 5: b = k + z; 6: c = a * 3; 7: if (N < 0){ 8: m = 5; 9: break; } 10: x++; 11:} while (x < N); y=N Do we have to execute 4 for every iteration? Compute manually values of x and y for every iteration What do you see? Do we have to execute 10 for every iteration?

Loop example y=N Do we have to execute 4 for every iteration? 1: if (N>5){ k = 1; z = 4;} 2: else {k = 2; z = 3;} do { 3: a = 1; 4: 5: b = k + z; 6: c = a * 3; 7: if (N < 0){ 8: m = 5; 9: break; } 10: x++;y++; 11:} while (x < N); y=N Do we have to execute 4 for every iteration? Do we have to execute 10 for every iteration?

Loop example y=N Do we have to execute 4 for every iteration? 1: if (N>5){ k = 1; z = 4;} 2: else {k = 2; z = 3;} do { 3: a = 1; 4: 5: b = k + z; 6: c = a * 3; 7: if (N < 0){ 8: m = 5; 9: break; } 10: x++;y++; 11:} while (y < (2*N)); y=N Do we have to execute 4 for every iteration? Do we have to execute 10 for every iteration?

Loop example y=N Do we have to execute 4 for every iteration? 1: if (N>5){ k = 1; z = 4;} 2: else {k = 2; z = 3;} do { 3: a = 1; 4: 5: b = k + z; 6: c = a * 3; 7: if (N < 0){ 8: m = 5; 9: break; } 10: y++; 11:} while (y < (2*N)); y=N Do we have to execute 4 for every iteration? Do we have to execute 10 for every iteration?

Loop example y=N;tmp=2*N; Do we have to execute 4 for every iteration? 1: if (N>5){ k = 1; z = 4;} 2: else {k = 2; z = 3;} do { 3: a = 1; 4: 5: b = k + z; 6: c = a * 3; 7: if (N < 0){ 8: m = 5; 9: break; } 10: y++; 11:} while (y < tmp); y=N;tmp=2*N; Do we have to execute 4 for every iteration? x, y are induction variables Do we have to execute 10 for every iteration?

Is the code transformation worth it? 1: if (N>5){ k = 1; z = 4;} 2: else {k = 2; z = 3;} A :y=N;tmp=2*N; do { 3: a = 1; 5: b = k + z; 6: c = a * 3; 7: if (N < 0){ 8: m = 5; 9: break; } 10: y++; 11:} while (y < tmp); 1: if (N>5){ k = 1; z = 4;} 2: else {k = 2; z = 3;} do { 3: a = 1; 4: y = x + N; 5: b = k + z; 6: c = a * 3; 7: if (N < 0){ 8: m = 5; 9: break; } 10: x++; 11:} while (x < N); Induction variable elimination

… and after Loop Invariant Code Motion ... 1: if (N>5){ k = 1; z = 4;} 2: else {k = 2; z = 3;} A :y=N;tmp=2*N; 3 :a=1; 5 :b=k+z; 6: c=a*3; do{ 7: if (N < 0){ 8: m = 5; 9: break; } 10: y++; 11:} while (y < tmp); 1: if (N>5){ k = 1; z = 4;} 2: else {k = 2; z = 3;} do { 3: a = 1; 4: y = x + N; 5: b = k + z; 6: c = a * 3; 7: if (N < 0){ 8: m = 5; 9: break; } 10: x++; 11:} while (x < N);

… and with a better Loop Invariant Code Motion ... 1: if (N>5){ k = 1; z = 4;} 2: else {k = 2; z = 3;} A :y=N;tmp=2*N; 3 :a=1; 5 :b=k+z; 6: c=a*3; 7: if (N < 0){ 8: m=5; } do{ 10: y++; 11:} while (y < tmp); 1: if (N>5){ k = 1; z = 4;} 2: else {k = 2; z = 3;} do { 3: a = 1; 4: y = x + N; 5: b = k + z; 6: c = a * 3; 7: if (N < 0){ 8: m = 5; 9: break; } 10: x++; 11:} while (x < N);

… and after dead code elimination ... 1: if (N>5){ k = 1; z = 4;} 2: else {k = 2; z = 3;} 3 :a=1; 5 :b=k+z; 6: c=a*3; 7: if (N < 0){ 8: m=5; } 1: if (N>5){ k = 1; z = 4;} 2: else {k = 2; z = 3;} do { 3: a = 1; 4: y = x + N; 5: b = k + z; 6: c = a * 3; 7: if (N < 0){ 8: m = 5; 9: break; } 10: x++; 11:} while (x < N); Assuming a,b,c,m are used after our code

Induction variable observation Observation: some variables change by a constant amount on each loop iteration x initialized at 0; increments by 1 y initialized at N; increments by 1 These are all induction variables How can we identify them?

Induction variable elimination Suppose we have a loop variable i initially set to i0; each iteration i = i + 1 and a variable that linearly depends on it x = i * c1 + c2 We can Initialize x = io * c1 + c2 Increment x by c1 each iteration constants

Is it faster? On some hardware, adds are much faster than multiplies Strength reduction Furthermore, one fewer value is computed, thus potentially saving a register and decreasing the possibility of spilling

Induction variables Basic induction variables i = i op c c is loop-invariant a.k.a. independent induction variable Derived induction variables j = i * c1 + c2 a.k.a. dependent induction variable

Identify induction variables: step 1 Find the basic IVs Scan loop body for defs of the form x = x + c, where c is loop-invariant Record these basic IVs as x = (x, 1, c) this represents the IV: x = x * 1 + c

Identify induction variables: step 2 Find derived IVs Scan for derived IVs of the form k = i * c1 + c2 where i is a basic IV and this is the only definition of k in the loop Record as k = (i, c1, c2) We say k is in the family of i

Identified induction variables i: basic j: basic q: derived from i k: derived from i x: derived from j z: derived from k A forest of induction variables

Many optimizations rely on IVs Like induction variable elimination we have seen before

Induction variable elimination: step 1 Iterate over IVs k = j * c1 + c2 where IV j =(i, a, b), and this is the only def of k in the loop, and there is no def of i between the def of j and the def of k Record as k = (i, a*c1, b*c1+c2) i = … … j = i … k = j …

Induction variable elimination: step 2 For an induction variable k = (i, c1, c2) Initialize k = i * c1 + c2 in the pre-header Replace k’s def in the loop by k = k + c1 Make sure to do this after i’s definition

Normalize induction variables Basic induction variables are normalized to Initial value: 0 Incremented by 1 at each iteration Explicit final value (either constant or loop-invariant) x = x*1 + 1 (x, 1, 1)

Normalize induction variables in LLVM indvars: normalize induction variables highlight the basic induction variables scalar-evolution: Scalar evolution analysis Represent scalar expressions (e.g., x = y op z) It supports induction variables (e.g., x = x + 1) It lowers the burden of explicitly handling the composition of expressions

LLVM scalar evolution example SCEV: {A, B, C}<flag>*<%D> A: Initial; B: Operator; C: Operand; D: basic block where it get defined

LLVM scalar evolution example SCEV: {A, B, C}<flag>*<%D> A: Initial; B: Operator; C: Operand; D: basic block where it get defined

LLVM scalar evolution example: pass deps

An use of the Scalar Evaluation pass Problem: we need to infer the number of iterations that a loop will execute for (int i=0; i < 10; i++){ BODY } Solution for this loop: 10 Any idea?

Scalar evolution in LLVM Analysis used by Induction variable substitution Strength reduction Vectorization … SCEVs are modeled by the llvm::SCEV class There is a sub-class for each kind of SCEV (e.g., llvm::SCEVAddExpr) A SCEV is a tree of SCEVs Leaves: Constant : llvm:SCEVConstant (e.g., 1) Unknown: llvm:SCEVUnknown To iterate over a tree: llvm:SCEVVisitor

Outline Loop invariants Induction variables Loop transformations

Loop transformations Restructure a loop to expose more optimization opportunities and/or transform the “loop overhead” Loop unrolling, loop peeling, loop splitting, … Reorganize a loop to improve memory utilization Cache blocking, skewing, loop reversal Distribute a loop over cores/processors DOACROSS, DOALL, DSWP, HELIX

Loop unrolling Unrolling factor: 2 Body Body Body %a=add %a, 1 %a=cmp %a, %5 branch %a %a=cmp %a, 10 branch %a Unrolling factor: 2 Body Body %a=add %a, 1 Body %a=add %a, 2

Loop peeling Body %a = add %a, 1 Peeling factor: 1 Body %a=add %a, 1 %a=cmp %a, 10 branch %a Body %a = add %a, 1 Peeling factor: 1 Body %a=cmp %a, %10 branch %a %a=add %a, 1 Body %a=add %a, 1

Loop transformations for memory optimizations Performance goal: reduce program clock cycles How many clock cycles will it take? … varX = array[5]; ...

Goal: improve cache performance To enhance Temporal locality A resource that has just been referenced will more likely be referenced again in the near future Spatial locality The likelihood of referencing a resource is higher if a resource near it was just referenced What to minimize: bad replacement decisions

What a compiler can do Time: Space: When is an object accessed? Space: Where does an object exist in the address space? These are the two “knobs” a compiler can manipulate

Manipulating time and space Time: Reordering computation Determine when an object will be accessed, and predict a better time to access it Space: Changing data layout Determine an object’s shape and location, and determine a better layout

First understand cache behavior ... When do cache misses occur? Use locality analysis Can we change the visitation order to produce better behavior? Evaluate costs Does the new visitation order still produce correct results? Use dependence analysis

… and then rely on loop transformations loop interchange cache blocking loop fusion loop reversal ...

Code example Iteration space for A double A[N][N], B[N][N]; … for i = 0 to N-1{ for j = 0 to N-1{ ... = A[i][j] ... } Iteration space for A

Loop interchange A[][] in C? Java? for i = 0 to N-1 for j = 0 to N-1 … = A[j][i] … For j = 0 to N-1 for i = 0 to N-1 … = A[j][i] … Assumptions: N is large; A is row-major; 2 elements per cache line A[][] in C? Java?

Cache blocking (a.k.a. tiling) for JJ = 0 to N-1 by B for i = 0 to N-1 for j = JJ to min(N-1,JJ+B-1) f(A[i], A[j]) for i = 0 to N-1 for j = 0 to N-1 f(A[i], A[j])

Loop fusion for i = 0 to N-1 C[i] = A[i]*2 + B[i] D[i] = A[i] * 2 Reduce loop overhead Improve locality by combining loops that reference the same array Increase the granularity of work done in a loop

Locality analysis Reuse: Accessing a location that has been accessed previously Locality: Accessing a location that is in the cache Observe: Locality only occurs when there is reuse! … but reuse does not imply locality

Steps in locality analysis Find data reuse Determine “localized iteration space” Set of inner loops where the data accessed by an iteration is expected to fit within the cache Find data locality Reuse ∩ localized iteration space ⇒ locality