Code Optimization, Part II Regional Techniques Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp.

Slides:

Advertisements

Similar presentations

Information Security of Embedded Systems : Design of Secure Systems Prof. Dr. Holger Schlingloff Institut für Informatik und Fraunhofer FIRST.

Advertisements

Example of Constructing the DAG (1)t 1 := 4 * iStep (1):create node 4 and i 0 Step (2):create node Step (3):attach identifier t 1 (2)t 2 := a[t 1 ]Step.

Linked Lists in C and C++ CS-2303, C-Term Linked Lists in C and C++ CS-2303 System Programming Concepts (Slides include materials from The C Programming.

SSA and CPS CS153: Compilers Greg Morrisett. Monadic Form vs CFGs Consider CFG available exp. analysis: statement gen's kill's x:=v 1 p v 2 x:=v 1 p v.

8. Static Single Assignment Form Marcus Denker. © Marcus Denker SSA Roadmap  Static Single Assignment Form (SSA)  Converting to SSA Form  Examples.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) SSA Guo, Yao.

Operator Strength Reduction From Cooper, Simpson, & Vick, “Operator Strength Reduction”, ACM TOPLAS, 23(5), See also § of EaC2e. 1COMP 512,

Lecture 11: Code Optimization CS 540 George Mason University.

Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.

Course Outline Traditional Static Program Analysis –Theory Compiler Optimizations; Control Flow Graphs Data-flow Analysis – today’s class –Classic analyses.

Control-Flow Graphs & Dataflow Analysis CS153: Compilers Greg Morrisett.

CS412/413 Introduction to Compilers Radu Rugina Lecture 37: DU Chains and SSA Form 29 Apr 02.

Components of representation Control dependencies: sequencing of operations –evaluation of if & then –side-effects of statements occur in right order Data.

A Deeper Look at Data-flow Analysis Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at Rice University.

Program Representations. Representing programs Goals.

Code Motion of Control Structures From the paper by Cytron, Lowry, and Zadeck, COMP 512, Rice University Copyright 2011, Keith D. Cooper & Linda.

SSA-Based Constant Propagation, SCP, SCCP, & the Issue of Combining Optimizations 1COMP 512, Rice University Copyright 2011, Keith D. Cooper & Linda Torczon,

The Last Lecture Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at Rice University have explicit permission.

1 CS 201 Compiler Construction Lecture 7 Code Optimizations: Partial Redundancy Elimination.

Lazy Code Motion Comp 512 Spring 2011

Loop Invariant Code Motion — classical approaches — 1COMP 512, Rice University Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students.

Introduction to Code Optimization Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice.

6/9/2015© Hal Perkins & UW CSEU-1 CSE P 501 – Compilers SSA Hal Perkins Winter 2008.

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

1 Intermediate representation Goals: –encode knowledge about the program –facilitate analysis –facilitate retargeting –facilitate optimization scanning.

Data Flow Analysis Compiler Design October 5, 2004 These slides live on the Web. I obtained them from Jeff Foster and he said that he obtained.

Improving Code Generation Honors Compilers April 16 th 2002.

Introduction to Optimization Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

Instruction Scheduling II: Beyond Basic Blocks Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp.

Data Flow Analysis Compiler Design Nov. 8, 2005.

PSUCS322 HM 1 Languages and Compiler Design II IR Code Optimization Material provided by Prof. Jingke Li Stolen with pride and modified by Herb Mayer PSU.

Global Common Subexpression Elimination with Data-flow Analysis Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students.

Code Optimization, Part III Global Methods Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412.

Structural Data-flow Analysis Algorithms: Allen-Cocke Interval Analysis Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students.

Introduction to Optimization, II Value Numbering & Larger Scopes Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students.

First Principles (with examples from value numbering) C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon,

1 June 4, June 4, 2016June 4, 2016June 4, 2016 Azusa, CA Sheldon X. Liang Ph. D. Azusa Pacific University, Azusa, CA 91702, Tel: (800)

Building SSA Form, III 1COMP 512, Rice University This lecture presents the problems inherent in out- of-SSA translation and some ways to solve them. Copyright.

12/5/2002© 2002 Hal Perkins & UW CSER-1 CSE 582 – Compilers Data-flow Analysis Hal Perkins Autumn 2002.

Global Redundancy Elimination: Computing Available Expressions Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled.

Cleaning up the CFG Eliminating useless nodes & edges C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon,

11/23/2015© Hal Perkins & UW CSEQ-1 CSE P 501 – Compilers Introduction to Optimization Hal Perkins Autumn 2009.

Algebraic Reassociation of Expressions Briggs & Cooper, “Effective Partial Redundancy Elimination,” Proceedings of the ACM SIGPLAN 1994 Conference on Programming.

Terminology, Principles, and Concerns, III With examples from DOM (Ch 9) and DVNT (Ch 10) Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved.

Dead Code Elimination This lecture presents the algorithm Dead from EaC2e, Chapter 10. That algorithm derives, in turn, from Rob Shillner’s unpublished.

Terminology, Principles, and Concerns, II With examples from superlocal value numbering (Ch 8 in EaC2e) Copyright 2011, Keith D. Cooper & Linda Torczon,

Cleaning up the CFG Eliminating useless nodes & edges This lecture describes the algorithm Clean, presented in Chapter 10 of EaC2e. The algorithm is due.

Profile-Guided Code Positioning See paper of the same name by Karl Pettis & Robert C. Hansen in PLDI 90, SIGPLAN Notices 25(6), pages 16–27 Copyright 2011,

Building SSA Form, I 1COMP 512, Rice University Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Data flow analysis John Cavazos University.

Definition-Use Chains

Introduction to Optimization

Finding Global Redundancies with Hopcroft’s DFA Minimization Algorithm

Introduction to Optimization

Wrapping Up Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit.

Building SSA Form COMP 512 Rice University Houston, Texas Fall 2003

Introduction to Optimization Hal Perkins Summer 2004

Optimization through Redundancy Elimination: Value Numbering at Different Scopes COMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith.

Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.

Optimizations using SSA

Introduction to Optimization

Reference These slides, with minor modification and some deletion, come from U. of Delaware – and the web, of course. 4/4/2019 CPEG421-05S/Topic5.

Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.

Algebraic Reassociation of Expressions COMP 512 Rice University Houston, Texas Fall 2003 P. Briggs & K.D. Cooper, “Effective Partial Redundancy Elimination,”

Reference These slides, with minor modification and some deletion, come from U. of Delaware – and the web, of course. 4/17/2019 CPEG421-05S/Topic5.

Compiler Construction

The Partitioning Algorithm for Detecting Congruent Expressions COMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper.

CSE P 501 – Compilers SSA Hal Perkins Autumn /31/2019

Optimizing Compilers CISC 673 Spring 2009 Data flow analysis

Presentation transcript:

Code Optimization, Part II Regional Techniques Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit permission to make copies of these materials for their personal use. Faculty from other educational institutions may use these materials for nonprofit educational purposes, provided this copyright notice is preserved. COMP 412 FALL 2010

Last Lecture Introduced concept of a redundant expression An expression, x+y, is redundant at point p if, along each path from the procedure’s entry point to p, x+y has already been evaluated and neither x nor y has been redefined. If x+y is redundant at p, we can save the results of those earlier evaluations and reuse them at p, avoiding evaluation In a single block, we need only consider one such path We developed an algorithm for redundancy elimination in a single basic block Two pieces to the problem —Proving that x+y is redundant —Rewriting the code to eliminate the redundant evaluation Value numbering does both for straightline code Comp 412, Fall 20101

Local Value Numbering (Recap) The LVN Algorithm, with bells & whistles Comp 412, Fall for i ← 0 to n-1 1. get the value numbers V 1 and V 2 for L i and R i 2. if L i and R i are both constant then evaluate Li Op i R i, assign it to T i, and mark T i as a constant 3. if Li Op i R i matches an identity then replace it with a copy operation or an assignment 4. if Op i commutes and V 1 > V 2 then swap V 1 and V 2 5. construct a hash key 6. if the hash key is already present in the table then replace operation I with a copy into T i and mark T i with the VN else insert a new VN into table for hash key & mark T i with the VN for i ← 0 to n-1 1. get the value numbers V 1 and V 2 for L i and R i 2. if L i and R i are both constant then evaluate Li Op i R i, assign it to T i, and mark T i as a constant 3. if Li Op i R i matches an identity then replace it with a copy operation or an assignment 4. if Op i commutes and V 1 > V 2 then swap V 1 and V 2 5. construct a hash key 6. if the hash key is already present in the table then replace operation I with a copy into T i and mark T i with the VN else insert a new VN into table for hash key & mark T i with the VN Block is a sequence of n operations of the form T i ← L i Op i R i Block is a sequence of n operations of the form T i ← L i Op i R i Constant folding Algebraic identities Commutativity

Comp 412, Fall Missed opportunities (need stronger methods) m  a + b n  a + b A p  c + d r  c + d B y  a + b z  c + d G q  a + b r  c + d C e  b + 18 s  a + b u  e + f D e  a + 17 t  c + d u  e + f E v  a + b w  c + d x  e + f F Local Value Numbering  1 block at a time  Strong local results  No cross-block effects Local Value Numbering  1 block at a time  Strong local results  No cross-block effects LVN finds these redundant ops

Comp 412, Fall Terminology m  a + b n  a + b A p  c + d r  c + d B y  a + b z  c + d G q  a + b r  c + d C e  b + 18 s  a + b u  e + f D e  a + 17 t  c + d u  e + f E v  a + b w  c + d x  e + f F Control-flow graph (CFG)  Nodes for basic blocks  Edges for branches  Basis for much of program analysis & transformation This CFG, G = (N,E)  N = {A,B,C,D,E,F,G}  E = {(A,B),(A,C),(B,G),(C,D), (C,E),(D,F),(E,F),(F,E)}  |N| = 7, |E| = 8

Scope of Optimization In scanning and parsing, “scope” refers to a region of the code that corresponds to a distinct name space. In optimization “scope” refers to a region of the code that is subject to analysis and transformation. Notions are somewhat related Connection is not necessarily intuitive Different scopes introduces different challenges & different opportunities Historically, optimization has been performed at several distinct scopes. Comp 412, Fall 20105

Scope of Optimization Local optimization Operates entirely within a single basic block Properties of block lead to strong optimizations Regional optimization Operate on a region in the CFG that contains multiple blocks Loops, trees, paths, extended basic blocks Whole procedure optimization ( intraprocedural ) Operate on entire CFG for a procedure Presence of cyclic paths forces analysis then transformation Whole program optimization ( interprocedural ) Operate on some or all of the call graph ( multiple procedures ) Must contend with call/return & parameter binding Comp 412, Fall A basic block is a maximal length sequence of straightline code.

A Comp 412 Fairy Tale We would like to believe optimization developed in an orderly fashion Local methods led to regional methods Regional methods led to global methods Global methods led to interprocedural methods It did not happen that way First compiler, FORTRAN, used both local & global methods Development has been scattershot & concurrent Scope appears to relate to the inefficiency being attacked, rather than the refinement of the inventor. Comp 412, Fall 20107

8 Superlocal Value Numbering m  a + b n  a + b A p  c + d r  c + d B y  a + b z  c + d G q  a + b r  c + d C e  b + 18 s  a + b u  e + f D e  a + 17 t  c + d u  e + f E v  a + b w  c + d x  e + f F * EBB: A maximal set of blocks B 1, B 2, …, B n where each B i, except B 1, has only exactly one predecessor and that block is in the EBB. {A,B,C,D,E} is an EBB It has 3 paths: (A,B), (A,C,D), & (A,C,E) Can sometimes treat each path as if it were a block {F} & {G} are degenerate EBBs Superlocal: “applied to an EBB” {A,B,C,D,E} is an EBB It has 3 paths: (A,B), (A,C,D), & (A,C,E) Can sometimes treat each path as if it were a block {F} & {G} are degenerate EBBs Superlocal: “applied to an EBB” A Regional Technique

Comp 412, Fall Superlocal Value Numbering m  a + b n  a + b A p  c + d r  c + d B y  a + b z  c + d G q  a + b r  c + d C e  b + 18 s  a + b u  e + f D e  a + 17 t  c + d u  e + f E v  a + b w  c + d x  e + f F The Concept  Apply local method to paths through the EBBs  Do { A,B }, { A,C,D }, & { A,C,E }  Obtain reuse from ancestors  Avoid re-analyzing A & C  Does not help with F or G The Concept  Apply local method to paths through the EBBs  Do { A,B }, { A,C,D }, & { A,C,E }  Obtain reuse from ancestors  Avoid re-analyzing A & C  Does not help with F or G * EBB: A maximal set of blocks B 1, B 2, …, B n where each B i, except B 1, has only exactly one predecessor and that block is in the EBB.

Comp 412, Fall Superlocal Value Numbering Efficiency Use A’s table to initialize tables for B & C To avoid duplication, use a scoped hash table —A, AB, A, AC, ACD, AC, ACE, F, G Need a VN  name mapping to handle kills —Must restore map with scope —Adds complication, not cost m  a + b n  a + b A p  c + d r  c + d B y  a + b z  c + d G q  a + b r  c + d C e  b + 18 s  a + b u  e + f D e  a + 17 t  c + d u  e + f E v  a + b w  c + d x  e + f F “kill” is a re- definition of some name

Comp 412, Fall Superlocal Value Numbering Efficiency Use A’s table to initialize tables for B & C To avoid duplication, use a scoped hash table —A, AB, A, AC, ACD, AC, ACE, F, G Need a VN  name mapping to handle kills —Must restore map with scope —Adds complication, not cost To simplify matters Need unique name for each definition Makes name  VN Use the SSA name space m  a + b n  a + b A p  c + d r  c + d B y  a + b z  c + d G q  a + b r  c + d C e  b + 18 s  a + b u  e + f D e  a + 17 t  c + d u  e + f E v  a + b w  c + d x  e + f F The subscripted names from the earlier example are an instance of the SSA name space. “kill” is a re- definition of some name

Comp 412, Fall SSA Name Space (locally) Example ( from earlier ): With VNs a 0 3  x y 0 2  b 0 3  x y 0 2 a 1 4  17  c 0 3  x y 0 2 Notation: While complex, the meaning is clear Notation: While complex, the meaning is clear Original Code a 0  x 0 + y 0  b 0  x 0 + y 0 a 1  17  c 0  x 0 + y 0 Renaming: Give each value a unique name Makes it clear Renaming: Give each value a unique name Makes it clear Rewritten a 0 3  x y 0 2  b 0 3  a 0 3 a 1 4  17  c 0 3  a 0 3 Result: a 0 3 is available Rewriting just works Result: a 0 3 is available Rewriting just works

Comp 412, Fall SSA Name Space ( in general ) Two principles Each name is defined by exactly one operation Each operand refers to exactly one definition To reconcile these principles with real code Insert  -functions at merge points to reconcile name space Add subscripts to variable names for uniqueness x   x +... x 0 ... x 1 ... x 2  ( x 0, x 1 )  x becomes

Comp 412, Fall Superlocal Value Numbering m 0  a + b n 0  a + b A p 0  c + d r 0  c + d B r 2   (r 0,r 1 ) y 0  a + b z 0  c + d G q 0  a + b r 1  c + d C e 0  b + 18 s 0  a + b u 0  e + f D e 1  a + 17 t 0  c + d u 1  e + f E e 3   (e 0,e 1 ) u 2   (u 0,u 1 ) v 0  a + b w 0  c + d x 0  e + f F Our example in SSA form Φ -functions at join points for names that need them Minimal set of Φ -functions (see Chapter 9 in EaC) Our example in SSA form Φ -functions at join points for names that need them Minimal set of Φ -functions (see Chapter 9 in EaC)

Superlocal Value Numbering The SVN Algorithm Comp 412, Fall WorkList ← { entry block } Empty ← new table while (WorkList is not empty) remove a block b from WorkList SVN(b, Empty) SVN( Block, Table) t ← new table for Block, with Table linked as surrounding scope LVN( Block, t) for each successor s of Block if s has just 1 predecessor then SVN( s, t ) else if s has not been processed then add s to WorkList deallocate t WorkList ← { entry block } Empty ← new table while (WorkList is not empty) remove a block b from WorkList SVN(b, Empty) SVN( Block, Table) t ← new table for Block, with Table linked as surrounding scope LVN( Block, t) for each successor s of Block if s has just 1 predecessor then SVN( s, t ) else if s has not been processed then add s to WorkList deallocate t Table for base case Blocks to process Use LVN for the work In the same EBB Starts a new EBB Assumes LVN has been parameterized around block and table

Comp 412, Fall Superlocal Value Numbering m 0  a + b n 0  a + b A p 0  c + d r 0  c + d B r 2   (r 0,r 1 ) y 0  a + b z 0  c + d G q 0  a + b r 1  c + d C e 0  b + 18 s 0  a + b u 0  e + f D e 1  a + 17 t 0  c + d u 1  e + f E e 3   (e 0,e 1 ) u 2   (u 0,u 1 ) v 0  a + b w 0  c + d x 0  e + f F With all the bells & whistles  Find more redundancy  Pay minimal extra cost  Still does nothing for F & G With all the bells & whistles  Find more redundancy  Pay minimal extra cost  Still does nothing for F & G Superlocal techniques  Some local methods extend cleanly to superlocal scopes  VN does not back up  Backward motion causes probs Superlocal techniques  Some local methods extend cleanly to superlocal scopes  VN does not back up  Backward motion causes probs

Loop Unrolling Applications spend a lot of time in loops We can reduce loop overhead by unrolling the loop Eliminated additions, tests, and branches —Can subject resulting code to strong local optimization! Only works with fixed loop bounds & few iterations The principle, however, is sound Unrolling is always safe, as long as we get the bounds right Comp 412, Fall A Regional Technique do i = 1 to 100 by 1 a(i) ← b(i) * c(i) end a(1) ← b(1) * c(1) a(2) ← b(2) * c(2) a(2) ← b(3) * c(3) … a(100) ← b(100) * c(100) Complete unrolling

Loop Unrolling Unrolling by smaller factors can achieve much of the benefit Example: unroll by 4 Achieves much of the savings with lower code growth Reduces tests & branches by 25% LVN will eliminate duplicate adds and redundant expressions Less overhead per useful operation But, it relied on knowledge of the loop bounds… Comp 412, Fall do i = 1 to 100 by 1 a(i) ← b(i) * c(i) end do i = 1 to 100 by 4 a(i) ← b(i) * c(i) a(i+1) ← b(i+1) * c(i+1) a(i+2) ← b(i+2) * c(i+2) a(i+3) ← b(i+3) * c(i+3) end Unroll by 4

Loop Unrolling Unrolling with unknown bounds Need to generate guard loops Achieves most of the savings Reduces tests & branches by 25% LVN still works on loop body Guard loop takes some space Can generalize to arbitrary upper & lower bounds, unroll factors Comp 412, Fall do i = 1 to n by 1 a(i) ← b(i) * c(i) end i ← 1 do while (i+3 < n ) a(i) ← b(i) * c(i) a(i+1) ← b(i+1) * c(i+1) a(i+2) ← b(i+2) * c(i+2) a(i+3) ← b(i+3) * c(i+3) i ←i + 4 end do while (i < n) a(i) ← b(i) * c(i) i ← i + 1 end Unroll by 4

Loop Unrolling One other unrolling trick Eliminate copies at the end of a loop Unroll by LCM of copy-cycle lengths Eliminates the copies, which were a naming artifact Achieves some of the benefits of unrolling —Lower overhead, longer blocks for local optimization Situation occurs in more cases than you might suspect Comp 412, Fall t1 ← b(0) do i = 1 to 100 by 1 t2 ← b(i) a(i) ← a(i) + t1 + t2 t1 ← t2 end Unroll and rename t1 ← b(0) do i = 1 to 100 by 2 t2 ← b(i) a(i) ← a(i) + t1 + t2 t1 ← b(i+1) a(i+1) ← a(i+1) + t2 + t1 end This result has been rediscovered many times. [Kennedy’s thesis]