Profile-Guided Code Positioning See paper of the same name by Karl Pettis & Robert C. Hansen in PLDI 90, SIGPLAN Notices 25(6), pages 16–27 Copyright 2011,

Slides:



Advertisements
Similar presentations
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) SSA Guo, Yao.
Advertisements

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
Register Allocation CS 671 March 27, CS 671 – Spring Register Allocation - Motivation Consider adding two numbers together: Advantages: Fewer.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
SSA-Based Constant Propagation, SCP, SCCP, & the Issue of Combining Optimizations 1COMP 512, Rice University Copyright 2011, Keith D. Cooper & Linda Torczon,
The Last Lecture Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at Rice University have explicit permission.
Loop Invariant Code Motion — classical approaches — 1COMP 512, Rice University Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students.
Introduction to Code Optimization Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice.
Intermediate Representations Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1 CS 201 Compiler Construction Lecture 13 Instruction Scheduling: Trace Scheduler.
Instruction Scheduling II: Beyond Basic Blocks Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Dynamic Optimization as typified by the Dynamo System See “Dynamo: A Transparent Dynamic Optimization System”, V. Bala, E. Duesterwald, and S. Banerjia,
O RERATıNG S YSTEM LESSON 10 MEMORY MANAGEMENT II 1.
Code Optimization, Part III Global Methods Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412.
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
The Procedure Abstraction, Part VI: Inheritance in OOLs Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled.
The Procedure Abstraction, Part V: Support for OOLs Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in.
Building SSA Form, III 1COMP 512, Rice University This lecture presents the problems inherent in out- of-SSA translation and some ways to solve them. Copyright.
Global Redundancy Elimination: Computing Available Expressions Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled.
Cleaning up the CFG Eliminating useless nodes & edges C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon,
Algebraic Reassociation of Expressions Briggs & Cooper, “Effective Partial Redundancy Elimination,” Proceedings of the ACM SIGPLAN 1994 Conference on Programming.
Local Instruction Scheduling — A Primer for Lab 3 — Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled.
Practical Path Profiling for Dynamic Optimizers Michael Bond, UT Austin Kathryn McKinley, UT Austin.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
Terminology, Principles, and Concerns, IV With examples from LIVE and global block positioning Copyright 2011, Keith D. Cooper & Linda Torczon, all rights.
Dead Code Elimination This lecture presents the algorithm Dead from EaC2e, Chapter 10. That algorithm derives, in turn, from Rob Shillner’s unpublished.
Boolean & Relational Values Control-flow Constructs Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in.
Register Allocation CS 471 November 12, CS 471 – Fall 2007 Register Allocation - Motivation Consider adding two numbers together: Advantages: Fewer.
Cleaning up the CFG Eliminating useless nodes & edges This lecture describes the algorithm Clean, presented in Chapter 10 of EaC2e. The algorithm is due.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
Instruction Scheduling Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Profile Guided Code Positioning C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.
Memory Optimizations & Post-Compilation Techniques CS 671 April 3, 2008.
Instruction Scheduling: Beyond Basic Blocks Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp.
Dynamic Compilation Code Layout Overview  Background and Motivation v Approaches »Profile guided code positioning v Results v More Examples »Online.
Definition-Use Chains
Introduction to Optimization
5.2 Eleven Advanced Optimizations of Cache Performance
Improving Program Efficiency by Packing Instructions Into Registers
Main Memory Management
Improving cache performance of MPEG video codec
Local Instruction Scheduling
CSCI1600: Embedded and Real Time Software
Feedback directed optimization in Compaq’s compilation tools for Alpha
Introduction to Optimization
Intermediate Representations
Instruction Scheduling: Beyond Basic Blocks
CS 201 Compiler Construction
Building SSA Form COMP 512 Rice University Houston, Texas Fall 2003
Local Instruction Scheduling — A Primer for Lab 3 —
Ann Gordon-Ross and Frank Vahid*
Code Shape III Booleans, Relationals, & Control flow
Intermediate Representations
The Last Lecture COMP 512 Rice University Houston, Texas Fall 2003
Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Introduction to Optimization
Instruction Scheduling: Beyond Basic Blocks
Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Lecture 17: Register Allocation via Graph Colouring
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
COMP755 Advanced Operating Systems
CSCI1600: Embedded and Real Time Software
Presentation transcript:

Profile-Guided Code Positioning See paper of the same name by Karl Pettis & Robert C. Hansen in PLDI 90, SIGPLAN Notices 25(6), pages 16–27 Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at Rice University have explicit permission to make copies of these materials for their personal use. Faculty from other educational institutions may use these materials for nonprofit educational purposes, provided this copyright notice is preserved. Comp 512 Spring COMP 512, Rice University

2 The Problem The placement of program text in memory matters Large working sets cause excessive TLB & page misses Bad placement can increase instruction cache misses  Pettis & Hansen report that on some benchmarks, 1 of 3 cycles was a cache miss (for PA-RISC ) Random placement leaves these effects to chance The plan Discover execution-time paths Rearrange the code to keep those paths in contiguous memory Make heavy use of execution profiles

COMP 512, Rice University3 Does this work? Examples within HP Pascal compiler  Moved frequently executed blocks to top of procedure  40% reduction in instruction cache misses  5% improvement in compiler’s running time Fortran compiler  Rearranged object files before linking  Attempt to improve locality on calls  20% system throughput improvement So, they believed...

COMP 512, Rice University4 Two Major Issues Procedure placement If A calls B, would like A & B in adjacent locations  On same page means smaller working set  Adjacent locations limit I-cache conflicts Unfortunately, many procedures might call B (& A)  A common & critical problem in interprocedural optimization This is an issue for the linker Block placement Same effects occur on a smaller scale Fall through branches create an additional incentive Rarely executed code fills up the cache, too! This is an issue for the compiler & optimizer Block placement precedes procedure placement

COMP 512, Rice University5 Procedure Placement Simple principles Build the call graph Annotate edges with execution frequencies Use “closest is best” placement  A calls B most often  place A next to B  Keeps branches short ( advantage on PA-RISC )  Direct mapped I-cache  A & B unlikely to overlap in I-cache Profiling the call graph Linker inserts a stub for each call that bumps a counter Counters are kept in statically initialized storage ( set to zero ) Adds overhead to execution, but only in training runs As much as 80 to 98% reduction in executed long branches Ignored indirect calls (through a pointer)

COMP 512, Rice University6 Procedure Placement Computing an order Combine all edges from A to B ( mutual recursion ) Select highest weight edge, say X  Y  Combine X & Y, along with their common edges, X  Z & Y  Z  Place X next to Y Repeat until graph cannot be reduced further X Y Z XY Z 6 May have disconnected subgraphs Must add new procedures at end > W  X and Y  Z with WZ & XY > Use weights in original graph > Largest weight closest The algorithm is straight forward

COMP 512, Rice University7 Block Placement Targets branches with unequal execution frequencies Make likely case the “fall through” case Move unlikely case out-of-line & out-of-sight Potential benefits Longer branch-free code sequences More executed operations per cache line Denser instruction stream  fewer cache misses Moving unlikely code  denser page use & fewer page faults

COMP 512, Rice University8 Block Placement Moving infrequently executed code B1B1 B4B4 B2B2 B3B3 B1B1 B4B4 B2B2 B3B Unlikely path gets fall through (cheap) case Likely path gets an extra branch Would like this to become B1B1 B4B4 B3B3 B2B2 Long distance In another page,... This branch goes away Denser instruction stream *

COMP 512, Rice University9 Block Placement Principles Goal is to eliminate taken branches  Build up traces – single paths Work from profile data  Edge profiles are better than block profiles Use a greedy, bottom-up strategy to combine blocks Gathering profile data Insert code to count edges Split critical edges Myriad details for gathering accurate whole-program edge counts

COMP 512, Rice University10 Block Placement The Idea Form chains that should be placed to form straight-line code First step: Build the chains 1.Make each block a degenerate chain & set its priority to # blocks 2.P  1 3.  edge e = in the CFG, in order by decreasing frequency if x is the tail of a chain, a, and y is the head of a chain, b then merge a and b else set priority(y) to min(priority(y),P ++) Point is to place targets after their sources, to make forward branches PA-RISC predicted most forward branches as taken, backward as not taken

COMP 512, Rice University11 Block Placement Second step: Lay out the code WorkList  chain containing the entry node, n 0 While (WorkList ≠ Ø) Pick the chain c with lowest priority(c) from WorkList Place it next in the code  edge leaving c add z to WorkList Intuitions Entry node first Tries to make edge from chain i to chain j a forward branch  Predicted as taken on target machine  Edge remains only if it is lower probability choice

COMP 512, Rice University12 Going Further – Procedure Splitting Any code that has profile count of zero (0) is “fluff” Move fluff into the distance  It rarely executes  Get more useful operations into I cache  Increase effective density of I cache Slower execution for rarely executed code Implementation Create a linkage-less procedure with an invented name Give it a priority that the linker will sort to the code’s end Replace original branch with a 0-profile branch to a 0-profile call  Cause linkage code to move to end of procedure to maintain density Branch to fluff becomes short branch to long branch. Block with long branch gets sorted to end of current procedure. *

COMP 512, Rice University13 Putting It Together Procedure placement is done in the linker Block placement is done in the optimizer  Allows branch elision due to fluff, other tailoring Speedups ranged from 2 to 10%, depending on cache size  On the PA-RISC; better results on x86 systems This technique paid off handsomely on early 1990s PCs  Long cache lines  Slow page faults  Microsoft insiders suggested it was most important optimization for codes like Office ( Word, PowerPoint, Excel )