Terminology, Principles, and Concerns, IV With examples from LIVE and global block positioning Copyright 2011, Keith D. Cooper & Linda Torczon, all rights.

Slides:

Advertisements

Similar presentations

Data-Flow Analysis II CS 671 March 13, CS 671 – Spring Data-Flow Analysis Gather conservative, approximate information about what a program.

Advertisements

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) SSA Guo, Yao.

Lecture 11: Code Optimization CS 540 George Mason University.

Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.

1 CS 201 Compiler Construction Lecture 3 Data Flow Analysis.

CS412/413 Introduction to Compilers Radu Rugina Lecture 37: DU Chains and SSA Form 29 Apr 02.

A Deeper Look at Data-flow Analysis Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at Rice University.

SSA-Based Constant Propagation, SCP, SCCP, & the Issue of Combining Optimizations 1COMP 512, Rice University Copyright 2011, Keith D. Cooper & Linda Torczon,

The Last Lecture Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at Rice University have explicit permission.

Lazy Code Motion Comp 512 Spring 2011

Lazy Code Motion C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled.

1 Data flow analysis Goal : collect information about how a procedure manipulates its data This information is used in various optimizations For example,

Loop Invariant Code Motion — classical approaches — 1COMP 512, Rice University Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students.

Introduction to Code Optimization Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice.

Foundations of Data-Flow Analysis. Basic Questions Under what circumstances is the iterative algorithm used in the data-flow analysis correct? How precise.

6/9/2015© Hal Perkins & UW CSEU-1 CSE P 501 – Compilers SSA Hal Perkins Winter 2008.

Common Sub-expression Elim Want to compute when an expression is available in a var Domain:

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

Global optimization. Data flow analysis To generate better code, need to examine definitions and uses of variables beyond basic blocks. With use- definition.

Data Flow Analysis Compiler Design Nov. 3, 2005.

4/25/08Prof. Hilfinger CS164 Lecture 371 Global Optimization Lecture 37 (From notes by R. Bodik & G. Necula)

1 CS 201 Compiler Construction Lecture 3 Data Flow Analysis.

Data Flow Analysis Compiler Design October 5, 2004 These slides live on the Web. I obtained them from Jeff Foster and he said that he obtained.

CS 412/413 Spring 2007Introduction to Compilers1 Lecture 29: Control Flow Analysis 9 Apr 07 CS412/413 Introduction to Compilers Tim Teitelbaum.

Data Flow Analysis Compiler Design Nov. 8, 2005.

Prof. Fateman CS 164 Lecture 221 Global Optimization Lecture 22.

Improving Code Generation Honors Compilers April 16 th 2002.

Introduction to Optimization Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

Instruction Scheduling II: Beyond Basic Blocks Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp.

Global optimization. Data flow analysis To generate better code, need to examine definitions and uses of variables beyond basic blocks. With use- definition.

Direction of analysis Although constraints are not directional, flow functions are All flow functions we have seen so far are in the forward direction.

1 Region-Based Data Flow Analysis. 2 Loops Loops in programs deserve special treatment Because programs spend most of their time executing loops, improving.

Precision Going back to constant prop, in what cases would we lose precision?

1 CS 201 Compiler Construction Data Flow Analysis.

Global Common Subexpression Elimination with Data-flow Analysis Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students.

Code Optimization, Part III Global Methods Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412.

Structural Data-flow Analysis Algorithms: Allen-Cocke Interval Analysis Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students.

Introduction to Optimization, II Value Numbering & Larger Scopes Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students.

Proliferation of Data-flow Problems Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at Rice University.

12/5/2002© 2002 Hal Perkins & UW CSER-1 CSE 582 – Compilers Data-flow Analysis Hal Perkins Autumn 2002.

Global Redundancy Elimination: Computing Available Expressions Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled.

Cleaning up the CFG Eliminating useless nodes & edges C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon,

Advanced Compiler Techniques LIU Xianhua School of EECS, Peking University Loops.

Terminology, Principles, and Concerns, III With examples from DOM (Ch 9) and DVNT (Ch 10) Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved.

Dead Code Elimination This lecture presents the algorithm Dead from EaC2e, Chapter 10. That algorithm derives, in turn, from Rob Shillner’s unpublished.

Terminology, Principles, and Concerns, II With examples from superlocal value numbering (Ch 8 in EaC2e) Copyright 2011, Keith D. Cooper & Linda Torczon,

Cleaning up the CFG Eliminating useless nodes & edges This lecture describes the algorithm Clean, presented in Chapter 10 of EaC2e. The algorithm is due.

Profile-Guided Code Positioning See paper of the same name by Karl Pettis & Robert C. Hansen in PLDI 90, SIGPLAN Notices 25(6), pages 16–27 Copyright 2011,

1 CS 201 Compiler Construction Lecture 2 Control Flow Analysis.

Iterative Data-flow Analysis C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved. Students.

Building SSA Form, I 1COMP 512, Rice University Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at.

Data Flow Analysis II AModel Checking and Abstract Interpretation Feb. 2, 2011.

Profile Guided Code Positioning C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.

Code Optimization Data Flow Analysis. Data Flow Analysis (DFA)  General framework  Can be used for various optimization goals  Some terms  Basic block.

Definition-Use Chains

Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.

Introduction to Optimization

Data Flow Analysis Suman Jana

Global Redundancy Elimination: Computing Available Expressions COMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda.

Topic 10: Dataflow Analysis

Introduction to Optimization

Building SSA Form COMP 512 Rice University Houston, Texas Fall 2003

Optimization through Redundancy Elimination: Value Numbering at Different Scopes COMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith.

Optimizations using SSA

Data Flow Analysis Compiler Design

Introduction to Optimization

Static Single Assignment

Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.

Lecture 17: Register Allocation via Graph Colouring

CSE P 501 – Compilers SSA Hal Perkins Autumn /31/2019

Presentation transcript:

Terminology, Principles, and Concerns, IV With examples from LIVE and global block positioning Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at Rice University have explicit permission to make copies of these materials for their personal use. Faculty from other educational institutions may use these materials for nonprofit educational purposes, provided this copyright notice is preserved. Comp 512 Spring 2011

Last Lecture Computing dominance information  Quick introduction to global data-flow analysis  That is compile-time reasoning about the runtime flow of values  Round-robin iterative algorithm to find MOP solution to DOM Using immediate dominators to improve on SVN  For a node n, start LVN with the hash table from IDOM(n)  Includes results from each predecessor in dominator tree  Use scoped hash table and SSA names to simplify algorithm  Some predecessor information at each node in CFG COMP 512, Rice University2

This Lecture Examples of Global Analysis and Transformation Computing live variables  Classic backwards global data-flow problem  Used in SSA construction, in register allocation Using live information to eliminate useless stores  Simple demonstration of the use of LIVE Single-procedure block placement algorithm (Pettis & Hansen)  Arrange the blocks to maximize fall-through branches  Improves code locality as a natural consequence COMP 512, Rice University3

Computing Live Information A value v is live at p if  a path from p to some use of v along which v is not re-defined Data-flow problems are expressed as simultaneous equations Annotate each block with sets LIVEOUT and LIVEIN LIVEOUT(b) =  s  succ(b) LIVEIN(s) LIVEIN(b) = UEVAR(b)  (LIVEOUT(b)  VARKILL(b)) LIVEOUT(n f ) =  where UEVAR(b) is the set of names used in block b before being defined in b VARKILL(b) is the set of names defined in b § in EaC Domain of LIVEOUT is variables COMP 512, Rice University4 Note that LIVE is a backwards data-flow problem

Computing Live Information The compiler can solve these equations with a simple algorithm The world’s quickest introduction to data-flow analysis ! WorkList  { all blocks } while ( WorkList ≠ Ø) remove a block b from WorkList Compute LIVEOUT(b) Compute LIVEIN(b) if LIVEIN(b) changed then add pred (b) to WorkList The Worklist Iterative Algorithm Why does this work?  LIVEOUT, LIVEIN  2 Names  UEVAR & VARKILL are constants for b  Equations are monotone  Finite # of additions to sets  will reach a fixed point ! Speed of convergence depends on the order in which blocks are “removed” & their sets recomputed COMP 512, Rice University5 The worklist should be implemented as a set so that it does not contain duplicate entries.

COMP 512, Rice University6 Using Live Information: Eliminating Unneeded Stores Transformation: Eliminating unneeded stores Value in a register, have seen last definition, never again used The store is dead ( except for debugging ) Compiler can eliminate the store The Plan: Solve for LIVEIN and LIVEOUT Walk through each block, bottom to top  Compute local LIVE incrementally  If target of STORE operation is not in LIVE, delete the STORE If all STOREs to a local variable are eliminated, can delete the space for it from the activation record |L IVE O UT | = |variables|

Using LIVE Information: Eliminating Unneeded Stores Safety If x ∉ LIVE(s) at some STORE s, its value is not used along any path from s to the exit node of the CFG  Its value is not read and is, therefore, dead Relies on the correctness of LIVE Profitability Assumes that not executing a STORE costs less than executing it Opportunity Linear search, block-by-block, for STORE operations  Could build a list of them while computing initial UEVAR set COMP 512, Rice University7

Block Placement The order of blocks in memory matters Bad placement can increase working set size ( TLB & page misses ) Fall-through and branch-taken paths differ in cost & locality The plan Discover which paths execute frequently Rearrange blocks to keep those paths in contiguous memory Finding hot paths Need execution profile information COMP 512, Rice University8

9 Block Placement Targets branches with unequal execution frequencies Make likely case the “fall through” case Move unlikely case out-of-line & out-of-sight Potential benefits Longer branch-free code sequences More executed operations per cache line Denser instruction stream  fewer cache misses Moving unlikely code  denser page use & fewer page faults

COMP 512, Rice University10 Block Placement Moving infrequently executed code B1B1 B4B4 B2B2 B3B3 B1B1 B4B4 B2B2 B3B Unlikely path gets fall through (cheap) case Likely path gets an extra branch Would like this to become B1B1 B4B4 B3B3 B2B2 Long distance In another page,... This branch goes away Denser instruction stream *

COMP 512, Rice University11 Block Placement Overview 1. Build chains of frequently executed paths  Work from profile data  Edge profiles are better than node profiles  Combine blocks with a simple greedy algorithm 2. Lay out the code so that chains follow short forward branches Gathering profile data Instrument the executable Statistical sampling Infer edge counts from performance count data While precision is desirable, a good approximation will probably work well.

COMP 512, Rice University12 Block Placement The Idea Form chains that should be placed to form straight-line code First step: Build the chains 1.Make each block a degenerate chain & set its priority to # blocks 2.P  1 3.  edge e = in the CFG, in order by decreasing frequency if x is the tail of a chain, a, and y is the head of a chain, b then merge a and b else set priority(y) to min(priority(y),P ++) Point is to place targets after their sources, to make forward branches PA-RISC predicted most forward branches as taken, backward as not taken Check code against Chapter 8

COMP 512, Rice University13 Block Placement Second step: Lay out the code WorkList  chain containing the entry node, n 0 While (WorkList ≠ Ø) Pick the chain c with lowest priority(c) from WorkList Place it next in the code  edge leaving c add z to WorkList Intuitions Entry node first Tries to make edge from chain i to chain j a forward branch  Predicted as taken on target machine  Edge remains only if it is lower probability choice Replace with corrected algorithm from Chapter 8

COMP 512, Rice University14 Going Further – Procedure Splitting Any code that has profile count of zero (0) is “fluff” Move fluff into the distance  It rarely executes  Get more useful operations into I cache  Increase effective density of I cache Slower execution for rarely executed code Implementation Create a linkage-less procedure with an invented name Give it a priority that the linker will sort to the code’s end Replace original branch with a 0-profile branch to a 0-profile call  Cause linkage code to move to end of procedure to maintain density Branch to fluff becomes short branch to long branch. Block with long branch gets sorted to end of current procedure. *

COMP 512, Rice University15 Block Placement Safety Changing position of code, not values it computes Barring bugs in implementation, should be safe Profitability More fall-through branches Where possible, more compiler-predicted branches Better code locality Opportunity Profile data shows high-frequency edges Looks at all blocks and edges in transformation – O(N+E ) Many transformations have an O(N+E ) component

Transformations We Have Seen COMP 512, Rice University16 ScopeNameAnalysisEffect Local LVNIncremental Redundancy, constants, & identities BalancingLIVE info.Enhance ILP Regional Superlocal VNCFG, EBBs Redundancy, constants, & identities Dominator VN CFG, DOM info. Global Dead store elim.LIVE info.Eliminate dead store Block placementCFG, Profiles Code locality & branch straightening Interprocedural Inline subs’n Proc. placement On Monday

End of Lecture Extra Slides Begin Here COMP 512, Rice University17

COMP 512, Rice University18 Iterative Data-flow Analysis Any finite semilattice is bounded Some infinite semilattices are bounded … … 0 ….001 ….002 … Real constants Termination If every f n  F is monotone, i.e., x ≤ y  f(x) ≤ f(y), and If the lattice is bounded, i.e., every descending chain is finite  Chain is sequence x 1, x 2, …, x n where x i  L, 1 ≤ i ≤ n  x i > x i+1, 1 ≤ i < n  chain is descending Then The set at each node can only change a finite number of times The iterative algorithm must halt on an instance of the problem  Finite lattice, bounded descending chains, & monotone functions  termination

COMP 512, Rice University19 Iterative Data-flow Analysis Correctness ( What does it compute? ) If every f n  F is monotone, i.e., x ≤ y  f(x) ≤ f(y), and If the semilattice is bounded, i.e., every descending chain is finite  Chain is sequence x 1, x 2, …, x n where x i  L, 1 ≤ i ≤ n  x i > x i+1, 1 ≤ i < n  chain is descending Given a bounded semilattice S and a monotone function space F  k such that f k (  ) = f j (  )  j > k f k (  ) is called the least fixed-point of f over S If L has a T, then  k such that f k ( T ) = f j ( T )  j > k and f k ( T ) is called the maximal fixed-point of f over S optimism f k (x) is the application of f to x k times

COMP 512, Rice University20 Iterative Data-flow Analysis Correctness If every f n  F is monotone, i.e., f(x  y) ≤ f(x)  f(y), and If the lattice is bounded, i.e., every descending chain is finite  Chain is sequence x 1, x 2, …, x n where x i  L, 1 ≤ i ≤ n  x i > x i+1, 1 ≤ i < n  chain is descending Then The round-robin algorithm computes a least fixed-point ( LFP ) The uniqueness of the solution depends on other properties of F Unique solution  it finds the one we want Multiple solutions  we need to know which one it finds

COMP 512, Rice University21 Iterative Data-flow Analysis Correctness Does the iterative algorithm compute the desired answer? Admissible Function Spaces 1.  f  F,  x,y  L, f (x  y) = f (x)  f (y) 2.  f i  F such that  x  L, f i (x) = x 3.f,g  F  h  F such that h(x ) = f (g(x)) 4.  x  L,  a finite subset H  F such that x =  f  H f (  ) If F meets these four conditions, then an instance of the problem will have a unique fixed point solution (instance  graph + initial values)  LFP = MFP = MOP  order of evaluation does not matter Not distributive  fixed point solution may not be unique *

COMP 512, Rice University22 Iterative Data-flow Analysis If a data-flow framework meets those admissibility conditions then it has a unique fixed-point solution The iterative algorithm finds the (best) answer The solution does not depend on order of computation Algorithm can choose an order that converges quickly Intuition Choose an order so that changes propagate as far as possible on each “sweep”  Process a node’s predecessors before the node Cycles pose problems, of course  Ignore back edges when computing the order? *

COMP 512, Rice University23 Ordering the Nodes to Maximize Propagation Postorder Reverse Postorder Reverse postorder visits predecessors before visiting a node Use reverse preorder for backward problems  Reverse postorder on reverse CFG is reverse preorder N+1 - postorder number

COMP 512, Rice University24 Iterative Data-flow Analysis Speed For a problem with an admissible function space & a bounded semilattice, If the functions all meet the rapid condition, i.e.,  f,g  F,  x  L, f (g(  )) ≥ g(  )  f (x)  x then, a round-robin, reverse-postorder iterative algorithm will halt in d(G)+3 passes over a graph G d(G) is the loop-connectedness of the graph w.r.t a DFST  Maximal number of back edges in an acyclic path  Several studies suggest that, in practice, d(G) is small ( <3 )  For most CFGs, d(G) is independent of the specific DFST Sets stabilize in two passes around a loop Each pass does O(E ) meets & O(N ) other operations *

COMP 512, Rice University25 Iterative Data-flow analysis What does this mean? Reverse postorder  Easily computed order that increases propagation per pass Round-robin iterative algorithm  Visit all the nodes in a consistent order ( RPO )  Do it again until the sets stop changing Rapid condition  Most classic global data-flow problems meet this condition These conditions are easily met  Admissible framework, rapid function space  Round-robin, reverse-postorder, iterative algorithm  The analysis runs in ( effectively ) linear time

COMP 512, Rice University26 Some problems are not admissible Global constant propagation First condition in admissibility  f  F,  x,y  L, f (x  y) = f (x)  f (y) Constant propagation is not admissible  Kam & Ullman time bound does not hold  There are tight time bounds, however, based on lattice height  Require a variable-by-variable formulation … a  b + c Function “f” models block’s effects f( S1 ) = {a=7,b=3,c=4} f( S2 ) = {a=7,b=1,c=6} f(S1  S2) = Ø S1 : {b=3,c=4} S2 : {b=1,c=6}

COMP 512, Rice University27 Some admissible problems are not rapid Interprocedural May Modify sets Iterations proportional to number of parameters  Not a function of the call graph  Can make example arbitrarily bad Proportional to length of chain of bindings… shift(a,b,c,d,e,f) { local t; … call shift(t,a,b,c,d,e); f = 1; … } Assume call-by-reference Compute the set of variables (in shift) that can be modified by a call to shift How long does it take? shift abcdef Nothing to do with d(G)