Peephole Optimization & Other Post-Compilation Techniques 1COMP 512, Rice University Copyright 2011, Keith D. Cooper, Linda Torczon, & Jason Eckhardt,

Slides:

Advertisements

Similar presentations

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.

Advertisements

Operator Strength Reduction From Cooper, Simpson, & Vick, “Operator Strength Reduction”, ACM TOPLAS, 23(5), See also § of EaC2e. 1COMP 512,

Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.

Compiler-Based Register Name Adjustment for Low-Power Embedded Processors Discussion by Garo Bournoutian.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Chapter 10 Code Optimization. A main goal is to achieve a better performance Front End Code Gen Intermediate Code source Code target Code user Machine-

1 Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.

Register Allocation CS 671 March 27, CS 671 – Spring Register Allocation - Motivation Consider adding two numbers together: Advantages: Fewer.

The Last Lecture Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at Rice University have explicit permission.

Loop Invariant Code Motion — classical approaches — 1COMP 512, Rice University Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students.

Introduction to Code Optimization Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice.

Code Shape III Booleans, Relationals, & Control flow Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled.

Peephole Optimization Final pass over generated code: examine a few consecutive instructions: 2 to 4 See if an obvious replacement is possible: store/load.

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

CS 536 Spring Intermediate Code. Local Optimizations. Lecture 22.

Instruction Selection Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

Instruction Selection, II Tree-pattern matching Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in.

Intermediate Code. Local Optimizations

Improving Code Generation Honors Compilers April 16 th 2002.

Introduction to Optimization Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

Wrapping Up Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

Introduction to Code Generation Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

Instruction Scheduling II: Beyond Basic Blocks Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp.

Code Generation CS 480. Can be complex To do a good job of teaching about code generation I could easily spend ten weeks But, don’t have ten weeks, so.

Instruction Selection Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.

Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

Code Optimization, Part III Global Methods Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412.

1 Code Generation Part II Chapter 8 (1 st ed. Ch.9) COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University,

1 Code Generation Part II Chapter 9 COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University, 2005.

Instruction Selection and Scheduling. The Problem Writing a compiler is a lot of work Would like to reuse components whenever possible Would like to automate.

Building SSA Form, III 1COMP 512, Rice University This lecture presents the problems inherent in out- of-SSA translation and some ways to solve them. Copyright.

Compiler Principles Fall Compiler Principles Lecture 0: Local Optimizations Roman Manevich Ben-Gurion University.

Cleaning up the CFG Eliminating useless nodes & edges C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon,

Algebraic Reassociation of Expressions Briggs & Cooper, “Effective Partial Redundancy Elimination,” Proceedings of the ACM SIGPLAN 1994 Conference on Programming.

Introduction to Code Generation Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice.

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

Dead Code Elimination This lecture presents the algorithm Dead from EaC2e, Chapter 10. That algorithm derives, in turn, from Rob Shillner’s unpublished.

Boolean & Relational Values Control-flow Constructs Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in.

CS412/413 Introduction to Compilers Radu Rugina Lecture 18: Control Flow Graphs 29 Feb 02.

Cleaning up the CFG Eliminating useless nodes & edges This lecture describes the algorithm Clean, presented in Chapter 10 of EaC2e. The algorithm is due.

2/22/2016© Hal Perkins & UW CSEP-1 CSE P 501 – Compilers Register Allocation Hal Perkins Winter 2008.

Profile-Guided Code Positioning See paper of the same name by Karl Pettis & Robert C. Hansen in PLDI 90, SIGPLAN Notices 25(6), pages 16–27 Copyright 2011,

Instruction Selection, Part I Selection via Peephole Optimization Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled.

Memory Optimizations & Post-Compilation Techniques CS 671 April 3, 2008.

©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.

Definition-Use Chains

Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.

Introduction to Optimization

Local Register Allocation & Lab Exercise 1

Optimization Code Optimization ©SoftMoore Consulting.

Local Instruction Scheduling

Instruction Scheduling for Instruction-Level Parallelism

Topic 10: Dataflow Analysis

Introduction to Optimization

Introduction to Code Generation

Wrapping Up Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit.

Code Shape III Booleans, Relationals, & Control flow

Instruction Selection, II Tree-pattern matching

The Last Lecture COMP 512 Rice University Houston, Texas Fall 2003

Register Allocation Hal Perkins Summer 2004

Register Allocation Hal Perkins Autumn 2005

Peephole Optimization & Other Post-Compilation Techniques COMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon,

Local Register Allocation & Lab Exercise 1

Introduction to Optimization

Optimization 薛智文 (textbook ch# 9) 薛智文 96 Spring.

Lecture 19: Code Optimisation

Code Generation Part II

CSE P 501 – Compilers SSA Hal Perkins Autumn /31/2019

Presentation transcript:

Peephole Optimization & Other Post-Compilation Techniques 1COMP 512, Rice University Copyright 2011, Keith D. Cooper, Linda Torczon, & Jason Eckhardt, all rights reserved. Students enrolled in Comp 512 at Rice University have explicit permission to make copies of these materials for their personal use. Faculty from other educational institutions may use these materials for nonprofit educational purposes, provided this copyright notice is preserved. Comp 512 Spring 2011

COMP 512, Rice University2 The Problem After compilation, the code still has some flaws Scheduling & allocation really are NP-Complete Optimizer may not implement every needed transformation Curing the problem More work on scheduling and allocation Implement more optimizations — or — Optimize after compilation  Peephole optimization  Link-time optimization

COMP 512, Rice University3 Post-compilation Optimization Field has attracted much attention lately New & interesting opportunities exist at link time & at run time  Interprocedural analysis & optimization  Resolved references & data-structure sizes  Code that is out of scope at other times, such as linkage code  Optimize object-only distributions of libraries & applications Hard problems in this arena  Reconstructing control flow

COMP 512, Rice University4 Unravelling Control-flow (TI TMS320C6x) B.S1 LOOP ; branch to loop ||ZERO.L1 A2 ; zero A side product ||ZERO.L2 B2 ; zero B side product B.S1 LOOP ; branch to loop ||ZERO.L1 A3 ; zero A side accumulator ||ZERO.L2 B3; zero B side accumulator || ZERO.D1 A1 ; zero A side load value ||ZERO.D2 B1 ; zero B side load value LOOP: LDW.D1 *A4++, A1; load a[i] \& a[i+1] ||LDW.D2 *B4++, B1 ; load a[i] \& a[i+1] ||MPY.M1X A1, B1, A2 ; load b[i] \& b[i+1] ||MPYH.M2X A1, B1, B2 ; a[i] * b[i] ||ADD.L1 A2, A3, A3; a[i+1] * b[i+1] ||ADD.L2 B2, B3, B3 ; ca += a[i] * b[i] || [B0]SUB.S2 B0, 1, B0; decrement loop counter || [B0]B.S1 LOOP; branch to loop ADD.L1X A3, B3, A3 \>; c = ca + cb Set up the loop + 1 more branch Single cycle loop ending with another branch Stuff 4 branches into the pipe

COMP 512, Rice University5 Unravelling Control-flow (TI TMS320C6x) B.S1 LOOP ; branch to loop ||ZERO.L1 A2 ; zero A side product ||ZERO.L2 B2 ; zero B side product B.S1 LOOP ; branch to loop ||ZERO.L1 A3 ; zero A side accumulator ||ZERO.L2 B3; zero B side accumulator || ZERO.D1 A1 ; zero A side load value ||ZERO.D2 B1 ; zero B side load value LOOP: LDW.D1 *A4++, A1; load a[i] \& a[i+1] ||LDW.D2 *B4++, B1 ; load a[i] \& a[i+1] ||MPY.M1X A1, B1, A2 ; load b[i] \& b[i+1] ||MPYH.M2X A1, B1, B2 ; a[i] * b[i] ||ADD.L1 A2, A3, A3; a[i+1] * b[i+1] ||ADD.L2 B2, B3, B3 ; ca += a[i] * b[i] || [B0]SUB.S2 B0, 1, B0; decrement loop counter || [B0]B.S1 LOOP; branch to loop ADD.L1X A3, B3, A3 \>; c = ca + cb From this, the post-optimization translator is supposed to recover a simple source code loop …

COMP 512, Rice University6 Post-compilation Optimization Field has attracted much attention lately New & interesting opportunities exist at link time & at run time  Interprocedural analysis & optimization  Resolved references & data-structure sizes  Code that is out of scope at other times, such as linkage code  Optimize object-only distributions of libraries & applications Hard problems in this arena  Reconstructing control flow  Reconstructing type & dimension information  Analyzing pointers & addresses While whole-program techniques are expensive, at link-time the compiler must traverse all that code anyway.

COMP 512, Rice University7 Potential Post-compilation Optimizations A mix of old and new techniques Peephole optimization Code positioning (& branch optimizations) Sinking (cross jumping) & hoisting Procedure abstraction (for space) Register reallocation (scavenging, Belady, coloring) Bit-transition reduction Dead & Clean Constant propagation Pointer disambiguation & register promotion

COMP 512, Rice University8 Peephole Optimization The Basic Idea Discover local improvements by looking at a window on the code  A tiny window is good enough — a peephole Slide the peephole over the code and examine the contents  Pattern match with a limited set of patterns Examples storeAI r 1  r 0,8 loadAI r 0,8  r 15 storeAI r 1  r 0,8 i2i r 1  r 15  addI r 2,0  r 7 mult r 4, r 7  r 10 mult r 4,r 2  r 10  jumpI  l 10 l 10: jumpI  l 11 jumpI  l 11 l 10: jumpI  l 11  } Less likely as a local sequence, but other opts can create it …

COMP 512, Rice University9 Peephole Optimization Early Peephole Optimizers (McKeeman) Used limited set of hand-coded patterns Matched with exhaustive search Small window, small pattern set  quick execution They proved effective at cleaning up the rough edges  Code generation is inherently local  Boundaries between local regions are trouble spots Improvements in code generation, optimization, & architecture should have let these fade into obscurity  Much better allocation & scheduling today than in 1965  But, we have much more complex architectures Window of 2 to 5 ops

COMP 512, Rice University10 Peephole Optimization Modern Peephole Optimizers (Davidson, Fraser) Larger, more complex ISAs  larger pattern sets This has produced a more systematic approach Expander Operation-by-operation expansion into L LIR Needs no context Captures full effect of an operation Expander A SM  L LIR Simplifier L LIR  L LIR Matcher L LIR  A SM A SM L LIR A SM L LIR add r 1,r 2  r 4 r 4  r 1 + r 2 cc  f(r 1 + r 2 ) 

COMP 512, Rice University11 Peephole Optimization Modern Peephole Optimizers (Davidson, Fraser) Larger, more complex ISAs  larger pattern sets This has produced a more systematic approach Simplifier Single pass over L LIR, moving the peephole Forward substitution, algebraic simplification, constant folding, & eliminating useless effects ( must know what is dead ) Eliminate as many L LIR operations as possible Expander A SM  L LIR Simplifier L LIR  L LIR Matcher L LIR  A SM A SM L LIR A SM L LIR

COMP 512, Rice University12 Peephole Optimization Modern Peephole Optimizers (Davidson, Fraser) Larger, more complex ISAs  larger pattern sets This has produced a more systematic approach Matcher Starts with reduced L LIR program Compares L LIR from peephole against pattern library Selects one or more ASM patterns that “cover” the L LIR  Capture all of its effects  May create new useless effects ( setting the cc ) Expander A SM  L LIR Simplifier L LIR  L LIR Matcher L LIR  A SM A SM L LIR A SM L LIR

COMP 512, Rice University13 Finding Dead Effects The simplifier must know what is useless (i.e., dead) Expander works in a context-independent fashion It can process the operations in any order  Use a backward walk and compute local L IVE information  Tag each operation with a list of useless values What about non-local effects?  Most useless effects are local — DEF & USE in same block  It can be conservative & assume L IVE until proven dead A SM mult r 5,r 9  r 12 add r 12,r 17  r 13 L LIR r 12  r 5 * r 9 cc  f(r 5 *r 9 ) r 13  r 12 + r 17 cc  f(r 12 +r 17 ) L LIR r 12  r 5 * r 9 cc  f(r 5 *r 9 ) r 13  r 12 + r 17 cc  f(r 12 +r 17 ) A SM madd r 5,r 9,r 17  r 13  This effect would prevent multiply-add from matching expandsimplifymatch

COMP 512, Rice University14 Peephole Optimization Can use it to perform instruction selection Key issue in selection is effective pattern matching Expander A SM  L LIR Simplifier L LIR  L LIR Matcher L LIR  A SM A SM L LIR A SM L LIR Using peephole system for instruction selection Have front-end generate L LIR directly Eliminates need for the Expander Keep Simplifier and Matcher  Add a simple register assigner, follow with real allocation This basic scheme is used in G CC

COMP 512, Rice University15 Peephole-Based Instruction Selection Optimizer L LIR  L LIR L LIR Simplifier L LIR  L LIR L LIR Matcher L LIR  A SM L LIR A SM Allocator A SM  A SM A SM Front End S ource  L LIR Source Basic Structure of Compilers like GCC Uses RTL as its IR ( very low level, register-transfer language ) Numerous optimization passes Quick translation into RTL limits what optimizer can do... Matcher generated from specs ( hard-coded tree-pattern matcher )

COMP 512, Rice University16 An Example Original Code w  x - 2 * y L LIR r 10  2 r 11 y r 12  r 0 + r 11 r 13  M (r 12 ) r 14  r 10 x r 13 r 15 x r 16  r 0 + r 15 r 17  M (r 16 ) r 18  r 17 - r 14 r 19 w r 20  r 0 + r 19 M (r 20 )  r 18  Translation Compiler’s IR  Expander — or —

COMP 512, Rice University17 r 10  2 r 11 y r 12  r 0 + r 11 r 13  M (r 12 ) r 14  r 10 x r 13 r 15 x r 16  r 0 + r 15 r 17  M (r 16 ) r 18  r 17 - r 14 r 19 w r 20  r 0 + r 19 M (r 20 )  r 18 Simplification with a three operation window r 10  2 r 11 y r 12  r 0 + r 11 r 10  2 r 12  r 0 y r 13  M (r 12 ) r 10  2 r 13  M (r 0 y) r 14  r 10 x r 13 r 13  M (r 0 y) r 14  2 x r 13 r 15 x No further improvement is found r 14  2 x r 13 r 17  M (r 0 x) r 18  r 17 - r 14 r 14  2 x r 13 r 16  r 0 x r 17  M (r 16 ) r 14  2 x r 13 r 15 x r 16  r 0 + r 15 r 17  M (r 0 x) r 18  r 17 - r 14 r 19 w r 18  r 17 - r 14 r 19 w r 20  r 0 + r 19 r 18  r 17 - r 14 r 20  r 0 w M (r 20 )  r 18 r 18  r 17 - r 14 M (r 0 w)  r 18 Original Code

COMP 512, Rice University18 Example, Continued r 13  M (r 0 y) r 14  2 x r 13 r 17  M (r 0 x) r 18  r 17 - r 14 M (r 0 w)  r 18 r 10 y r 11  r 0 + r 10 r 12  M (r 11 ) r 13  2 r 14  r 12 x r 13 r 15 x r 16  r 0 + r 15 r 17  M (r 16 ) r 18  r 17 - r 14 r 19 w r 20  r 0 + r 19 M (r 20 )  r 18 Simplification shrinks the code significantly  Simplify Takes 5 operations instead of 12 Uses 4 registers instead of 11. loadAI r y  r 13 multI r 13,2  r 14 loadAI r x  r 17 sub r 17,r 14  r 18 storeAI r 18  r w  Match and, we’re done...

COMP 512, Rice University19 Other Considerations Control-flow operations Can clear simplifier’s window at branch or label More aggressive approach: combine across branches  Must account for effects on all paths  Not clear that this pays off …. Same considerations arise with predication Physical versus logical windows Can run optimizer over a logical window  k operations connected by DEF-USE chains Expander can link DEF s &USE s Logical windows ( within block ) improve effectiveness Davidson & Fraser report 30% faster & 20% fewer ops with local logical window.

COMP 512, Rice University20 Peephole Optimization So, … Peephole optimization remains viable  Post allocation improvements  Cleans up rough edges Peephole technology works for selection  Description driven matchers ( hard coded or LR(1) )  Used in several important systems Simplification pays off late in process  Low-level substitution, identities, folding, & dead effects All of this will work equally well in binary-to-binary translation

COMP 512, Rice University21 Other Post-Compilation Techniques What else makes sense to do after compilation? Profile-guided code positioning  Allocation intact, schedule intact Cross-jumping  Allocation intact, schedule changed Hoisting ( harder )  Changes allocation & schedule, needs data-flow analysis Procedure abstraction  Changes allocation & schedule, really needs an allocator Register scavenging  Changes allocation & schedule, purely local transformation Bit-transition reduction  Schedule & allocation intact, assignment changed Harder problems: 1. Rescheduling 2. Operation placement 3. Vectorization ✓ ✓ ✓ ✓ ✓

COMP 512, Rice University22 Register Scavenging Simple idea Global allocation does a good job on the big picture items Leaves behind blocks where some registers are unused Let’s scavenge those unused registers Compute LIVE information Walk each block to find underallocated region  Find spilled local subranges  Opportunistically promote them to registers A note of realism: Opportunities exist, but this is a 1% to 2% improvement T.J. Harvey, Reducing the Impact of Spill Code, MS Thesis, Rice University, May 1998

COMP 512, Rice University23 Bit-transition Reduction Inter-operation bit-transitions relate to power consumption Large fraction of CMOS power is spent switching states Same op on same functional unit costs less power  All other things being equal Simple idea Reassign registers to minimize interoperation bit transitions Build some sort of weighted graph Use a greedy algorithm to pick names by distance Should reduce power consumption in fetch & decode hardware Waterman’s MS thesis ( in preparation )

COMP 512, Rice University24 Bit-transition Reduction Other transformations Swap operands on commutative operators  More complex than it sounds  Shoot for zero-transition pairs Swap operations within “fetch packets”  Works for superscalar, not V LIW Consider bit transitions in scheduling  Same ops to same functional unit  Nearby (Hamming distance) ops next, and so on … Factor bit transitions into instruction selection  Maybe use a BURS model with dynamic costs Again, most of this fits into a post-compilation framework…..