Operator Strength Reduction From Cooper, Simpson, & Vick, “Operator Strength Reduction”, ACM TOPLAS, 23(5), 1991. See also § 10.7.2 of EaC2e. 1COMP 512,

Slides:

Advertisements

Similar presentations

SSA and CPS CS153: Compilers Greg Morrisett. Monadic Form vs CFGs Consider CFG available exp. analysis: statement gen's kill's x:=v 1 p v 2 x:=v 1 p v.

Advertisements

8. Static Single Assignment Form Marcus Denker. © Marcus Denker SSA Roadmap  Static Single Assignment Form (SSA)  Converting to SSA Form  Examples.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) SSA Guo, Yao.

Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

Jeffrey D. Ullman Stanford University. 2  A never-published Stanford technical report by Fran Allen in  Fran won the Turing award in  Flow.

Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.

A Deeper Look at Data-flow Analysis Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at Rice University.

Code Motion of Control Structures From the paper by Cytron, Lowry, and Zadeck, COMP 512, Rice University Copyright 2011, Keith D. Cooper & Linda.

SSA-Based Constant Propagation, SCP, SCCP, & the Issue of Combining Optimizations 1COMP 512, Rice University Copyright 2011, Keith D. Cooper & Linda Torczon,

Lazy Code Motion C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled.

Loop Invariant Code Motion — classical approaches — 1COMP 512, Rice University Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students.

Introduction to Code Optimization Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice.

Intermediate Representations Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.

Recap from last time We were trying to do Common Subexpression Elimination Compute expressions that are available at each program point.

Code Shape III Booleans, Relationals, & Control flow Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled.

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

CS 412/413 Spring 2007Introduction to Compilers1 Lecture 29: Control Flow Analysis 9 Apr 07 CS412/413 Introduction to Compilers Tim Teitelbaum.

U NIVERSITY OF M ASSACHUSETTS, A MHERST D EPARTMENT OF C OMPUTER S CIENCE Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

Improving Code Generation Honors Compilers April 16 th 2002.

Instruction Scheduling II: Beyond Basic Blocks Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp.

Dynamic Optimization as typified by the Dynamo System See “Dynamo: A Transparent Dynamic Optimization System”, V. Bala, E. Duesterwald, and S. Banerjia,

Code Optimization, Part III Global Methods Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412.

U NIVERSITY OF M ASSACHUSETTS, A MHERST D EPARTMENT OF C OMPUTER S CIENCE Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

Structural Data-flow Analysis Algorithms: Allen-Cocke Interval Analysis Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students.

Introduction to Optimization, II Value Numbering & Larger Scopes Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students.

1 Code Generation Part II Chapter 9 COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University, 2005.

Using SSA Dead Code Elimination & Constant Propagation C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon,

Operator Strength Reduction C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved. Students.

Building SSA Form, III 1COMP 512, Rice University This lecture presents the problems inherent in out- of-SSA translation and some ways to solve them. Copyright.

Global Redundancy Elimination: Computing Available Expressions Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled.

Cleaning up the CFG Eliminating useless nodes & edges C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon,

Algebraic Reassociation of Expressions Briggs & Cooper, “Effective Partial Redundancy Elimination,” Proceedings of the ACM SIGPLAN 1994 Conference on Programming.

Local Instruction Scheduling — A Primer for Lab 3 — Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled.

Parsing — Part II (Top-down parsing, left-recursion removal) Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students.

Top-down Parsing Recursive Descent & LL(1) Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412.

Dead Code Elimination This lecture presents the algorithm Dead from EaC2e, Chapter 10. That algorithm derives, in turn, from Rob Shillner’s unpublished.

Boolean & Relational Values Control-flow Constructs Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in.

CS412/413 Introduction to Compilers Radu Rugina Lecture 18: Control Flow Graphs 29 Feb 02.

1 Control Flow Graphs. 2 Optimizations Code transformations to improve program –Mainly: improve execution time –Also: reduce program size Can be done.

Terminology, Principles, and Concerns, II With examples from superlocal value numbering (Ch 8 in EaC2e) Copyright 2011, Keith D. Cooper & Linda Torczon,

Cleaning up the CFG Eliminating useless nodes & edges This lecture describes the algorithm Clean, presented in Chapter 10 of EaC2e. The algorithm is due.

Instruction Selection, Part I Selection via Peephole Optimization Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled.

Building SSA Form, I 1COMP 512, Rice University Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at.

Instruction Scheduling Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

Instruction Scheduling: Beyond Basic Blocks Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp.

CS 412/413 Spring 2005Introduction to Compilers1 CS412/CS413 Introduction to Compilers Tim Teitelbaum Lecture 30: Loop Optimizations and Pointer Analysis.

Eliminating Array Bounds Checks — and related problems — 1COMP 512, Rice University Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved.

Definition-Use Chains

Introduction to Optimization

Local Instruction Scheduling

Introduction to Optimization

Code Generation Part III

Intermediate Representations

Introduction to Code Generation

Wrapping Up Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit.

Instruction Scheduling: Beyond Basic Blocks

Building SSA Form COMP 512 Rice University Houston, Texas Fall 2003

Code Shape III Booleans, Relationals, & Control flow

Intermediate Representations

Optimization through Redundancy Elimination: Value Numbering at Different Scopes COMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith.

Code Generation Part III

Optimizations using SSA

Introduction to Optimization

Instruction Scheduling: Beyond Basic Blocks

Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.

Algebraic Reassociation of Expressions COMP 512 Rice University Houston, Texas Fall 2003 P. Briggs & K.D. Cooper, “Effective Partial Redundancy Elimination,”

The Partitioning Algorithm for Detecting Congruent Expressions COMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper.

CSE P 501 – Compilers SSA Hal Perkins Autumn /31/2019

Presentation transcript:

Operator Strength Reduction From Cooper, Simpson, & Vick, “Operator Strength Reduction”, ACM TOPLAS, 23(5), See also § of EaC2e. 1COMP 512, Rice University Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at Rice University have explicit permission to make copies of these materials for their personal use. Faculty from other educational institutions may use these materials for nonprofit educational purposes, provided this copyright notice is preserved. Comp 512 Spring 2011

COMP 512, Rice University2 Operator Strength Reduction Consider the following simple loop What’s wrong with this picture Takes 3 operations to compute the address of a(i) On most machines, the integer multiply (r 2 ) is slow sum = 0 do i = 1 to 100 sum = sum + a(i) and do loadI 0  r sum loadI 1  r i loadI 100  r 100 loop:subIr i, 1  r 1 multIr 1, 4  r 2 addIr a  r 3 loadr 3  r 4 add r 4,r sum  r sum addIr i, 1  r i cmp _LT r i,r 100  r 5 cbr r 5  loop,exit exit:... } address of a(i)

COMP 512, Rice University3 Operator Strength Reduction Consider the value sequences for the temporaries The only one we care about is r 3 We can compute it directly & cheaply loadI 0  r sum loadI 1  r i loadI 100  r 100 loop:subIr i, 1  r 1 multIr 1, 4  r 2 addIr a  r 3 loadr 3  r 4 add r 4,r sum  r sum addIr i, 1  r i cmp _LT r i,r 100  r 5 cbr r 5  loop,exit exit:... r i = { 1, 2, 3, 4, … } r 1 = { 0, 1, 2, 3, … } r 2 = { 0, 4, 8, 12, … } r 3 = a+12, … }

COMP 512, Rice University4 Operator Strength Reduction Computing r 3 directly yields From 8 operations in the loop to 6 operations No expensive multiply, just cheap adds loadI 0  r sum loadI 1  r i loadI 100  r 100 a  r 3 loop:loadr 3  r 4 addIr 3, 4  r 3 add r 4,r sum  r sum addIr i, 1  r i cmp _LT r i,r 100  r 5 cbr r 5  loop,exit exit:... r 3 = a+12, … } Still, we can do better... * address of a(i)

COMP 512, Rice University5 Operator Strength Reduction Changing the loop’s exit test to use r 3 yields Address computation went from -,+,* to + Exit test went from +, cmp to cmp Loop body went from 8 operations to 5 operations Got rid of that expensive multiply, too loadI 0  r sum a  r 3 addIr 3, 396  r lim loop:loadr 3  r 4 addIr 3, 4  r 3 add r 4,r sum  r sum cmp _LT r 3,r lim  r 5 cbr r 5  loop,exit exit:... r 3 = a+12, … } } Pretty good speed up for most machines 37.5% of ops in the loop

COMP 512, Rice University6 Operator Strength Reduction Definition Strong form Replace series of multiplies with adds Weak form Replace single multiply with shifts and adds The Problem Its easy to see the transformation Its somewhat harder to automate the process Operator Strength Reduction is a transformation that replaces a strong (expensive) operator with a weaker (cheaper) operator See, for example, R. Bernstein, "Multiplication by Integer Constants," Software -- Practice and Experience, 16(7), July, pp Today’s lecture presents an algorithm for the strong form of operator strength reduction.

COMP 512, Rice University7 Operator Strength Reduction Assumptions Low-level IR, such as ILOC, converted into SSA form Interpret SSA -form as a graph Terminology A strongly connected component ( SCC ) of a directed graph is a region where a path exists from each node to every other node A region constant ( RC ) of an SCC is an SCC -invariant value An induction variable ( IV ) of an SCC is one whose value only changes in the SCC when operations increment it by an RC or an IV, or when it is the destination of a COPY from another IV A candidate for reduction is an operation “x  y * z” where y, z  IV  RC and either y  IV or z  IV * Intuitively, we are interested in induction variables that are updated in a cyclic fashion. This creates the repetition from which OSR derives its benefits. The classic papers, however, define IVs this way. As you will see, our algorithm only finds IVs that form a cycle in the SSA graph

COMP 512, Rice University8 Operator Strength Reduction loadI 0  r s0 loadI 1  r i0 loadI 100  r 100 loop:phir s0,r s2  r s1 phir i0,r i2  r i1 subIr i1, 1  r 1 multIr 1, 4  r 2 addIr a  r 3 loadr 3  r 4 add r 4,r s1  r s2 addIr i1, 1  r i2 cmp _LT r i2,r 100  r 5 cbr r 5  loop,exit exit:... Code in semi-pruned S SA FormS SA Form as a Graph { Short-lived temporary values { 0 Ø load Ø r4r4 r s0 r s1 r s2 r i0 r i1 r i2 r3r3 r2r2 r1r1 cmp _LT cbr r5r5 pc 100

COMP 512, Rice University9 Operator Strength Reduction S SA form as a graph Each IV is an SCC Not every SCC is an IV x  RC if x is a constant or its definition dominates the SCC S SA simplifies O SR Find IV s with SCC finder Test operations in SCC Constant time test for RC > Constant or D OM Without SSA, need several passes 0 Ø load Ø r4r4 r s0 r s1 r s2 r i0 r i1 r i2 r3r3 r2r2 r1r1 cmp _LT cbr r5r5 pc 100

COMP 512, Rice University10 Operator Strength Reduction Finding SCC s Use Tarjan’s algorithm Well-understood method Takes O(N+E ) time Useful property SCC popped only after all its external operands have been popped Reduce the SCC s as popped  | SCC | > 1  if its an IV, mark it  | SCC | = 1  try to reduce it DFS(n) n.DFSnum  nextDFSnum++ n.visited  true n.low  n.DFSnum push(n) for each o  { operands of n} if o.visited = false then DFS(o) n.low  min(n.low, o.low) if o.DFSnum < n.DFSnum and o  stack then n.low  min(n.low, o.DFSnum) if n.low = n.DFSnum then SCC  { } until x = n do x  pop() SCC  SCC  { x } We only need to add one line Process( SCC ) *

COMP 512, Rice University11 Operator Strength Reduction What should Process(r) do? If r is one node, try to reduce it If r is a collection of nodes  Check to see if it is an IV  If so, reduce it & any ops that use it  If not, try to reduce the ops in r Process(r) if r has only one member, n then if n has the form x  IV x RC, x  RC x IV, x  IV ± RC, or x  RC + IV then Replace(n, IV, RC ) else n.header  NULL else ClassifyIV(r) Let’s tackle the easier problem first – ClassifyIV()

COMP 512, Rice University12 Operator Strength Reduction ClassifyIV(r) header  first(r) for each node n  r if header  RPOnum > n.block  RPOnum then header  n.block for each node n  r if n.op is not one of { Ø, +, -, COPY } then r is not an induction variable else for each o  { operands of n } if o  r and not RCon(o,header) then r is not an induction variable if r is an induction variable then for each node n  r n.header  header else for each node n  r if n has the form x  IV x RC, x  RC x IV, x  IV ± RC, or x  RC + IV then Replace(n, IV, RC ) else n.header  NULL Rcon(o,header) if o.op is loadI /* constant */ then return true else if o.block >> header then return true else return false { Find SCC header by CFG RPO # { Reduce these ops >> means “strictly dominates”

COMP 512, Rice University13 Operator Strength Reduction /* replace n with a COPY */ Replace(n,iv,rc) result  Reduce(n.op,iv,rc) Replace n with COPY from result n.header  iv.header /* create new IV & return its name */ Reduce(op,iv,rc) result  search(op,iv,rc) if result is not found then result  a new name add(op,iv,rc,result) newDef  copyDef(iv,result) for each operand o of newDef if o.header = iv.header then replace o with Reduce(op,o,rc) else if (opcode = x or newDef.op = Ø) then replace o with Apply(op,o,rc) return result * Returns name of op applied to iv and rc Clones the definition The Big Picture Reduce() creates a new IV, with appropriate range & increment For t3, in our example, would be range a a+396, with an increment of 4 Replace takes a candidate operation and rewrites it with a COPY from the new IV. It uses Reduce to create the IV. Net effect: replace a with a COPY from some new IV that runs a a+396 & increments by 4 on each iteration

COMP 512, Rice University14 Operator Strength Reduction /* insert a new operation */ Apply(op,arg1,arg2) result  search(op,arg1,arg2) if result is not found then if (arg1.header ≠ NULL /*  IV */ & RCon(arg2,arg1.header) then result  Reduce(op,arg1,arg2) else if arg2.header ≠ NULL /*  IV */ & RCon(arg1,arg2.header) then result  Reduce(op,arg2,arg1) else result  a new name add(op,arg1,arg2,result) Choose a location to insert op Try constant folding Create newOp at the location newOper.header  NULL return result The Big Picture Apply takes an op & 2 args and inserts the corresponding operation into the code (if it isn’t already there). Uses >> on arg1 & arg2 to find a location Tries to reduce the operation Tries to simplify the operation Net effect: replace (i-1)*4+a with a COPY from some new IV that runs & increments by 4 on each iteration *

COMP 512, Rice University15 Example And, most of this is dead... 0 Ø load + r4r4 r s0 r s1 r s2 1 + Ø 1 r i0 r i1 r i2 cmp _LT cbr r5r5 pc 1 + Ø 0 r a0 r a1 r a2 COPY r1r1 4 + Ø 0 r a3 r a4 r a5 COPY r2r2 4 + Ø r a6 r a7 r a8 COPY r3r3 100 *

COMP 512, Rice University16 Example 0.0 Ø load + r4r4 r s0 r s1 r s2 1 + Ø 1 r i0 r i1 r i2 cmp _LT cbr r5r5 pc 4 + Ø r a6 r a7 r a8 COPY r3r3 This is dead, except for the comparison & branch. Need to reformulate them on r a8 100 The transformation to perform this simplification is called linear function test replacement *

COMP 512, Rice University17 Linear Function Test Replacement Each time a new, reduced IV is created Add an LFTR edge from old IV to new IV Label edge with the opcode and RC of the reduction Walk the LFTR edges to accumulate the transformation Use transformation to rewrite the test

COMP 512, Rice University18 Example pc 0 Ø load + r4r4 r s0 r s1 r s2 1 + Ø 1 r i0 r i1 r i2 cmp _LT cbr r5r5 1 + Ø 0 r a0 r a1 r a2 COPY r1r1 4 + Ø 0 r a3 r a4 r a5 COPY r2r2 4 + Ø r a6 r a7 r a8 COPY r3r3 100 x 4 a Follow the edges to find the right IV and to accumulate the transformation

COMP 512, Rice University19 Example 1 + Ø 1 r i0 r i1 r i2 Now, this is dead, too Not dead ! r a8 0.0 Ø load + r4r4 r s0 r s1 r s2 4 + Ø r a6 r a7 COPY r3r3 cbr r5r5 pc a cmp _LT

COMP 512, Rice University20 Example loadI 0  r sum a  r 3 addIr3, 396  r lim loop:loadr 3  r 4 addIr 3, 4  r 3 add r 4,r sum  r sum cmp _LT r 3,r lim  r 5 cbr r 5  loop,exit exit:... And, we’re done.. r a8 0.0 Ø load + r4r4 r s0 r s1 r s2 4 + Ø r a6 r a7 COPY r3r3 cbr r5r5 pc a cmp _LT

COMP 512, Rice University21 Complexity What does all this cost? LFTR takes time proportional to the length of the LFTR edge chain that it follows What about OSR?  Each cycle it creates clones every node in the cycle  How bad can that get?

COMP 512, Rice University22 Worst-case Example i  0; t 1  0; t 2  0; … ; t n  0 while( P 0 ) if (P 1 ) then t 1  t 1 + c 1 ; t 2  t 2 + c 2 ; … ; t n  t n + c n i  i + 1 k  t 1; if (P 2 ) then t 1  t 1 + 2x c 1 ; t 2  t x c 2 ; … ; t n  t n + 2 x c n ; i  i + 2 k  t 2 … if (P n ) then t 1  t 1 + n x c 1 ; t 2  t 2 + n x c 2 ; … ; t n  t n + n x c n ; i  i + n k  t n end i  0 while( P 0 ) if (P 1 ) then i  i + 1 k  i x c 1 if (P 2 ) then i  i + 2 k  i x c 2 … if (P n ) then i  i + n k  i x c n end This code requires a quadratic number of updates

COMP 512, Rice University23 Complexity What does all this cost? LFTR takes time proportional to the length of the LFTR edge chain that it follows What about OSR?  Each cycle it creates clones every node in the cycle  How bad can that get?  In the worst case, OSR must insert a number of updates that is quadratic in the size of the original code  Any strength reduction algorithm must insert the same set of updates, if it is to reduce the computation  If it doesn’t, it misses the opportunity  Complexity is part of the problem, not part of the solution OSR is as fast (asymptotically) as others  Constant factor faster than Allen-Cocke-Kennedy