Dynamic Compilation Code Layout Overview  Background and Motivation v Approaches »Profile guided code positioning v Results v More Examples »Online.

Slides:



Advertisements
Similar presentations
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) SSA Guo, Yao.
Advertisements

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Path Profile Estimation and Superblock Formation Jeff Pang Jimeng Sun.
1 CS 201 Compiler Construction Lecture 13 Instruction Scheduling: Trace Scheduler.
Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Instruction Scheduling II: Beyond Basic Blocks Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp.
Overview Integration Testing Decomposition Based Integration
P ARALLEL P ROCESSING I NSTITUTE · F UDAN U NIVERSITY 1.
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.
© 2010 IBM Corporation Code Alignment for Architectures with Pipeline Group Dispatching Helena Kosachevsky, Gadi Haber, Omer Boehm Code Optimization Technologies.
Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari
RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.
Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
Dynamo: A Transparent Dynamic Optimization System Bala, Dueterwald, and Banerjia projects/Dynamo.
Using Dynamic Binary Translation to Fuse Dependent Instructions Shiliang Hu & James E. Smith.
Trace Fragment Selection within Method- based JVMs Duane Merrill Kim Hazelwood VEE ‘08.
CS412/413 Introduction to Compilers Radu Rugina Lecture 18: Control Flow Graphs 29 Feb 02.
1 Control Flow Graphs. 2 Optimizations Code transformations to improve program –Mainly: improve execution time –Also: reduce program size Can be done.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
Profile-Guided Code Positioning See paper of the same name by Karl Pettis & Robert C. Hansen in PLDI 90, SIGPLAN Notices 25(6), pages 16–27 Copyright 2011,
Profile Guided Code Positioning C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.
PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,
Improving Multi-Core Performance Using Mixed-Cell Cache Architecture
Chapter 8: Recursion Data Structures in Java: From Abstract Data Types to the Java Collections Framework by Simon Gray.
Global Register Allocation Based on
High-level optimization Jakub Yaghob
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
ARM Organization and Implementation
Optimizing Compilers Background
CS 326A: Motion Planning Probabilistic Roadmaps for Path Planning in High-Dimensional Configuration Spaces (1996) L. Kavraki, P. Švestka, J.-C. Latombe,
5.2 Eleven Advanced Optimizations of Cache Performance
Online Subpath Profiling
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Cache Memory Presentation I
Samira Khan University of Virginia Nov 13, 2017
Improving Program Efficiency by Packing Instructions Into Registers
Capriccio – A Thread Model
CSCI1600: Embedded and Real Time Software
Optimizing Malloc and Free
Feedback directed optimization in Compaq’s compilation tools for Alpha
Instruction Scheduling Hal Perkins Summer 2004
EE 382N Guest Lecture Wish Branches
CS 201 Compiler Construction
Instruction Scheduling Hal Perkins Winter 2008
Ann Gordon-Ross and Frank Vahid*
Linked Lists.
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Sampoorani, Sivakumar and Joshua
Trace-based Just-in-Time Type Specialization for Dynamic Languages
Instruction Scheduling Hal Perkins Autumn 2005
Optimization 薛智文 (textbook ch# 9) 薛智文 96 Spring.
Algorithms: Design and Analysis
Page Replacement FIFO, LIFO, LRU, NUR, Second chance
Dynamic Hardware Prediction
Chapter 11 Processor Structure and function
EECS 583 – Class 3 Region Formation, Predicated Execution
CSE P 501 – Compilers SSA Hal Perkins Autumn /31/2019
Instruction Scheduling Hal Perkins Autumn 2011
CSCI1600: Embedded and Real Time Software
Dynamic Binary Translators and Instrumenters
Pipelining Hazards.
Just In Time Compilation
Presentation transcript:

Dynamic Compilation Code Layout

- 1 - Overview  Background and Motivation v Approaches »Profile guided code positioning v Results v More Examples »Online feedback directed optimization of Java »StarJIT v Conclusion v Q&A

- 2 - Background – Feedback Driven Code Generation v Feedback driven code generations »Feedback driven techniques to improve code quality generated by the compiler v Code layout or code positioning allows rearrangement of code in order to improve instruction locality and improve branch prediction v FDO code generation can be done with online profiling or offline profile data

- 3 - Motivation v TLB and i-cache misses are a bottleneck to performance in any architecture v Why is code layout important? »Rearrange code to increase instruction locality »Fewer i-cache misses »Fewer TLB misses »Placement of hot code near close by, short jumps »More accurate branch prediction »More efficient packing of instructions

- 4 - Motivation v For RISC type architectures there is a two- fold increase in instruction memory requirements than CISC type architectures v PA-RISC showed a CPI of about 3 for MPE/XL operating system. Of this, 1 of the 3 cycles was due wholly to instruction cache misses.

- 5 - OLTP workload characteristics v Capturing 99% of the instructions requires 200KB (Total footprint = 260KB) v Due to non-ideal packing of instructions, actual size = 500 KB Source : Code layout optimization for transaction processing workloads Flat execution profile and large instruction Footprint!

- 6 - Impact of code layout optimization on i-cache misses Source : Code layout optimization for transaction processing workloads

- 7 - Approaches v Pettis & Hansen, 1990 v Improve instruction memory hierarchy v Two levels of optimization »Linker – procedural code rearrangement »Compiler - basic block rearrangement v Previous work focused at page granularity v The focus here is – »Procedure (subspace) granularity »Basic block granularity

- 8 - Pettis & Hansen – Prototype 1 v Uses dynamic call graph information for procedure positioning v “closest-is-best” approach »This reduces the number of long branches as well v Profiling mechanism – »The linker sees all direct calls between subspaces »Linker adds in a stub to increment counter »Counters were maintained in the application data space v Note: indirect calls using procedure pointers were not measured

- 9 - Procedure Ordering Algorithm v Input : Undirected weighted call graph from collected profile information v Nodes are procedures in the graph v Edges correspond to calls between procedures »If two procedures are mutually recursive or if a procedure calls another from several different places, the weights are merged in the call graph v Bottom up method, “closest-is-best”

Procedure Ordering Algorithm v Pick an edge with maximum weight, merge the two nodes v Coalesce the weights of the edges leaving this combined node v Repeat process till the graph consists of no edges, only disjoint nodes v Once an order is established, the linker places the subspaces in that sequence

Procedure Ordering Algorithm Example v Heaviest edge is A-C, edge weight 10 v Merge A and C v Update the outgoing edges from newly combined node AC

Procedure Ordering Algorithm Example v Heaviest edge is B-D, weight 8 v Merge BD and update outgoing edges v Now, edge (BD) – (AC) is the heaviest v Decision to combine BD, AC

Procedure Ordering Algorithm Example v There are 4 orders in which BD and AC can be merged »B – D – A – C or C – A – D - B »B – D – C – A or A – C – D - B »D – B – A – C or C – A – B - D »D – B – C – A or A – C – B – D v From the original graph, we see that AB had higher edge weight than BC. D has weight zero from A and C v Thus, order chosen is D – B – A - C

Procedure Ordering Algorithm Example v Ordering for E is trivial

In-class problem

Pettis & Hansen – Prototype 2 v Basic block analysis »Restructuring of branch statements to ensure backward branches are mainly taken while forward branches are mainly not-taken v Main idea : Move infrequent code to locations farther away so that normal flow remains in a straight line sequence v Edge weights are profiled and not just basic block counts

Pettis & Hansen – Prototype 2 v Benefits : »Longer sequence of code executed before taking a branch »Larger number of useful instruction per cacheline »Fewer cache misses, denser instruction stream »Reduced branch misprediction »Better use of branch delay slots »Special case : if-then-else with a seldomly executed else clause (think exception handlers), moving infrequently executed code results in elimination of unconditional branch

Basic Block Reordering Algorithm v Algorithm 1 : Bottom up approach v Objective : create chains of basic blocks v Input : graph of basic blocks with edge weights being profile count v Begin with each basic block being the head and tail of the chain v Two different chains are merged together if the edge connects the tail of one to the head of the other

Basic Block Reordering Algorithm v If the source of the arc is not a tail or the target of the arc is not a head, then the chains cannot be merged. v If the more frequently executed arc out of a conditional branch results in the merger of chains (as it usually will, hot code flows to hot code), then when the less frequently executed arc is considered, no merger will be possible as the arc’s source (the conditional branch) will already be in the middle of a chain.

Basic Block Reordering Algorithm Example v Step 1 : B-C highest edge »B-C v Step 2 : C-D, added »B-C-D v Step 3 : N-B, D-F and E- N added, »E-N-B-C-D-F v Step 4 : D-E edge, discarded »Target/source D is not head or tail

Basic Block Reordering Algorithm Example v Step 5 : F-H, added »E-N-B-C-D-F-H v Step 6 : H-N, F-I discarded »Target/source F and N not in head/tail v Step 7 : I-J, new chain »E-N-B-C-D-F-H »I-J v Continue in this manner till all edges have been considered, added to a chain or discarded

Basic Block Reordering Algorithm Example v Final set of chains »A »E-N-B-C-D-F-H »I-J-L »G-O »K »M v Ordering of the chains must be such that non- taken conditional branches are forward branches

Basic Block Reordering Algorithm Example v 6 conditional branches »At B: chain 2 (B) before chain 4 (O) »At C : chain 2 (C) before chain 4 (G) »At F : chain 2 (F) before chain 3 (I) »At I : chain 3 (I) before chain 6 (M) »At J : chain 3 (J) before chain 5 (K) »At D : Link D-E should be forward, but since it is part of the same chain, no reordering v Final order of chains A, E-N-B-C-D-F-H, I-J-L, G-O, K, M v Final set of chains »A »E-N-B-C-D-F-H »I-J-L »G-O »K »M

Basic Block Reordering Algorithm v Algorithm 2 : Top down approach v Objective : create chains of basic blocks v Input : graph of basic blocks with edge weights being profile count v Simple depth first approach v Begin with a node, place the successor with higher edge weight immediately after and continue till the successor is not already placed. Start again at unselected node and continue till all nodes have been visited once.

Basic Block Reordering Algorithm Example v Begin at A v A –B –C –D – F –H – N v Continue at E (since it was the highest edge weight seen but not selected) v Continue at I and so on v Final ordering v A-B-C-D-F-H-N-E-I-J- L-O-K-G-M

Top down vs Bottom Up approach v Order of basic blocks »Bottom up – A-E-N-B-C-D-F-H-I-J-L-G-O-K-M »Top down - A-B-C-D-F-H-N-E-I-J-L-O-K-G-M v Both approaches convert conditional branches to forward branches (usually not taken) v Advantage of bottom up over top down approach »Better at removing unconditional branches

Procedure Splitting v Basic blocks that are frequently executed are located towards the top of the procedure (primary portion) v Infrequent code located at the bottom “fluff” v Procedure splitting is the process of separating the fluff from the primary portion v Addition of stubs with long branches to relocated fluff region v Creation of a new procedure with the fluff blocks.

Procedure Splitting v Reduces the number of pages required for hot code from 2 to 1

Experimental findings PP – Procedure positioning; BBP – Basic Block Positioning, PS – Procedure splitting Performance breakdown for small application - Othello

Procedure Positioning Experimental Data Number of static long branches increased However, number of executed long branches is drastically lowered

Procedure Splitting Experimental Data v Overhead of creating stubs are very low v Ratio of number of fluff instructions moved to long branches inserted

Inferences v Side effect to Basic Block positioning »Reduction in the number of executed instructions »Straightening of if-then sequences nullified the delay slots (counted as wasted instructions) »Removal of infrequent basic blocks removes the need for unconditional jump to get around the infrequent block v BBP alone could achieve 2-3 percent in performance simply due to reducing the number of executed instructions v The average number of instructions executed before taking a branch increased form 6.19 to 8.09, a 31% improvement. v The number of executed penalty branches was reduced by 42%

Drawbacks v 2 pass compilation v Debugging Positioned Code v No reuse of data »Necessity to recollect profiling information v Representative inputs »Input used during profiling phase v Fluff block size not considered

Procedure vs. Basic block level Procedure Positioning (PP)Basic block Positioning (BBP) Linker levelCompiler level Larger granularityFiner granularity Hot and cold code sections are interspersed Hot code is closer together at the top of the procedure Lower benefits compared to BBPAllows procedure splitting Call graphs require lesser space (fewer nodes in comparison to graph for BBP) Massive control graphs, large number of basic blocks

Example 1: Online feedback directed optimization of Java Online vs offline Online strategies Code reordering result

Online profiling vs. Offline profiling Online profilingOffline profiling Difficult to implementEasier to implement High runtime costNo runtime cost (profiling is done on a prior run of the program) Optimize within methodsOptimize complete program Optimization happen right awayOptimization happened next round Profiling information may not always be available Profiling information available prior to execution Capable of capturing phase behavior of the application Captures average behavior of the application Lower accuracyBetter accuracy

Online profile strategies 1. Profile early during unoptimized execution. »Profile during the interoperate stage »Once detect the hot code, optimize the code in code cache. »No profiling or instrumentation when executing optimized code v Advantages : »Low overhead on profile (degradation is hidden by interpreter latency) »Profiling information is available early for earlier FDO v Disadvantages : »Profiling of optimized code is not always accurate »Optimization might be ineffective or counter- productive.

Online profile strategies 2. Profile optimized code »Profile after interpretation, during lightweight optimization phase »Code patching- Remove instrumentation of optimized code after 3-4 runs of the optimized code »Profile in short bursts to avoid overhead Recording complete information in the system is not possible Information may not be Representative of Program behavior May introduce architecture Specific complexities Such as maintaining cache coherency Identifying this phase is difficult

Online feedback directed optimization of Java v Code reordering »Based on static heuristic- mark block as cold. (e.g. exception handlers). v Move cold block to the bottom of the procedure v Top down approach of Pettis & Hansen

Improvement due to code reordering Source : Online feedback directed optimization of Java

Improvement due to code reordering Source : Online feedback directed optimization of Java

Example 2: The StarJIT Compiler: A Dynamic Compiler for Managed Runtime Environments Overall Tail-duplication Result

The StarJIT Compiler: A Dynamic Compiler for Managed Runtime Environments v Specific target for Intel architecture (Itanium Processor Family) v Single compilation »Similar to static compiler v Tranforms Java bytecode or CLI bytecode to native code v Aggressive optimization using static heuristics v Online Profile guided Optimization v Global Optimizer – change profile intensity overtime

Source : The StarJIT Compiler: A Dynamic Compiler for Managed Runtime Environments, Intel

Source : The StarJIT Compiler: A Dynamic Compiler for Managed Runtime Environments, Intel

Optimizations performed v Trace picker- picks longer trace (the main trace) v Tail duplication- eliminate cold side entry v Method splitting (procedure splitting) »Partitioning compiled code into hot and cold section.

Tail duplication v Tail Duplication: Create Superblocks by first identifying traces and then eliminating side entries into a trace. v Create a separate off-trace copy of the basic blocks in between a side entrance and the trace exit and redirecting the edge corresponding to the side entry to the copy.

Tail duplication example A If (condition){ B }else{ C } D

Tradeoff of Tail Duplication Benefit v Only change the branch to target. No additional modify on the copied or the original code. => little overhead v Superblock – high-performance code of hot trace. Disadvantage: v Code growth.

Tradeoffs v Trace scheduling and selection after code layout to schedule the unconditional branches. v But tail duplication must be done after the code layout is finalized. v Tail duplication decisions are best made with input from trace formation. Trace Picking Code Layout Tail Duplication

Results Source : Comparing tail duplication with compensation code in single path global instruction, v Average 2.03% speedup v Average 1.51% code growth

Conclusions v Code layout optimizations are done to improve instruction locality v The main idea is to bring hot code closer together on the same page to reduce the number of long branches and i- cache misses v Other benefits »Longer sequence of code executed before taking a branch »Larger number of useful instruction per cacheline »Fewer cache misses, denser instruction stream »Reduced branch misprediction »Better use of branch delay slots v Procedure splitting, procedure positioning, basic block reordering are the main techniques used v Top-down and bottom-up approach

Conclusions v Feedback driven optimizations require profiling information that can be »Online »Offline v Tradeoffs on profiling techniques are »increase in runtime, »two passes, »representativeness of information gathered, »Phase nature of the application »Ease of implementation »Overhead of instrumenting and sampling