Dynamic Compilation Code Layout Overview  Background and Motivation v Approaches »Proﬁle guided code positioning v Results v More Examples »Online.

Dynamic Compilation Code Layout

- 1 - Overview  Background and Motivation v Approaches »Proﬁle guided code positioning v Results v More Examples »Online feedback directed optimization of Java »StarJIT v Conclusion v Q&A

- 2 - Background – Feedback Driven Code Generation v Feedback driven code generations »Feedback driven techniques to improve code quality generated by the compiler v Code layout or code positioning allows rearrangement of code in order to improve instruction locality and improve branch prediction v FDO code generation can be done with online profiling or offline profile data

- 3 - Motivation v TLB and i-cache misses are a bottleneck to performance in any architecture v Why is code layout important? »Rearrange code to increase instruction locality »Fewer i-cache misses »Fewer TLB misses »Placement of hot code near close by, short jumps »More accurate branch prediction »More efficient packing of instructions

- 4 - Motivation v For RISC type architectures there is a two- fold increase in instruction memory requirements than CISC type architectures v PA-RISC showed a CPI of about 3 for MPE/XL operating system. Of this, 1 of the 3 cycles was due wholly to instruction cache misses.

- 5 - OLTP workload characteristics v Capturing 99% of the instructions requires 200KB (Total footprint = 260KB) v Due to non-ideal packing of instructions, actual size = 500 KB Source : Code layout optimization for transaction processing workloads Flat execution profile and large instruction Footprint!

- 6 - Impact of code layout optimization on i-cache misses Source : Code layout optimization for transaction processing workloads

- 7 - Approaches v Pettis & Hansen, 1990 v Improve instruction memory hierarchy v Two levels of optimization »Linker – procedural code rearrangement »Compiler - basic block rearrangement v Previous work focused at page granularity v The focus here is – »Procedure (subspace) granularity »Basic block granularity

- 8 - Pettis & Hansen – Prototype 1 v Uses dynamic call graph information for procedure positioning v “closest-is-best” approach »This reduces the number of long branches as well v Profiling mechanism – »The linker sees all direct calls between subspaces »Linker adds in a stub to increment counter »Counters were maintained in the application data space v Note: indirect calls using procedure pointers were not measured

- 9 - Procedure Ordering Algorithm v Input : Undirected weighted call graph from collected profile information v Nodes are procedures in the graph v Edges correspond to calls between procedures »If two procedures are mutually recursive or if a procedure calls another from several different places, the weights are merged in the call graph v Bottom up method, “closest-is-best”

- 10 - Procedure Ordering Algorithm v Pick an edge with maximum weight, merge the two nodes v Coalesce the weights of the edges leaving this combined node v Repeat process till the graph consists of no edges, only disjoint nodes v Once an order is established, the linker places the subspaces in that sequence

- 11 - Procedure Ordering Algorithm Example v Heaviest edge is A-C, edge weight 10 v Merge A and C v Update the outgoing edges from newly combined node AC

- 12 - Procedure Ordering Algorithm Example v Heaviest edge is B-D, weight 8 v Merge BD and update outgoing edges v Now, edge (BD) – (AC) is the heaviest v Decision to combine BD, AC

- 13 - Procedure Ordering Algorithm Example v There are 4 orders in which BD and AC can be merged »B – D – A – C or C – A – D - B »B – D – C – A or A – C – D - B »D – B – A – C or C – A – B - D »D – B – C – A or A – C – B – D v From the original graph, we see that AB had higher edge weight than BC. D has weight zero from A and C v Thus, order chosen is D – B – A - C

- 14 - Procedure Ordering Algorithm Example v Ordering for E is trivial

- 15 - In-class problem

- 16 - Pettis & Hansen – Prototype 2 v Basic block analysis »Restructuring of branch statements to ensure backward branches are mainly taken while forward branches are mainly not-taken v Main idea : Move infrequent code to locations farther away so that normal flow remains in a straight line sequence v Edge weights are profiled and not just basic block counts

- 17 - Pettis & Hansen – Prototype 2 v Benefits : »Longer sequence of code executed before taking a branch »Larger number of useful instruction per cacheline »Fewer cache misses, denser instruction stream »Reduced branch misprediction »Better use of branch delay slots »Special case : if-then-else with a seldomly executed else clause (think exception handlers), moving infrequently executed code results in elimination of unconditional branch

- 18 - Basic Block Reordering Algorithm v Algorithm 1 : Bottom up approach v Objective : create chains of basic blocks v Input : graph of basic blocks with edge weights being profile count v Begin with each basic block being the head and tail of the chain v Two different chains are merged together if the edge connects the tail of one to the head of the other

- 19 - Basic Block Reordering Algorithm v If the source of the arc is not a tail or the target of the arc is not a head, then the chains cannot be merged. v If the more frequently executed arc out of a conditional branch results in the merger of chains (as it usually will, hot code flows to hot code), then when the less frequently executed arc is considered, no merger will be possible as the arc’s source (the conditional branch) will already be in the middle of a chain.

- 20 - Basic Block Reordering Algorithm Example v Step 1 : B-C highest edge »B-C v Step 2 : C-D, added »B-C-D v Step 3 : N-B, D-F and E- N added, »E-N-B-C-D-F v Step 4 : D-E edge, discarded »Target/source D is not head or tail

- 21 - Basic Block Reordering Algorithm Example v Step 5 : F-H, added »E-N-B-C-D-F-H v Step 6 : H-N, F-I discarded »Target/source F and N not in head/tail v Step 7 : I-J, new chain »E-N-B-C-D-F-H »I-J v Continue in this manner till all edges have been considered, added to a chain or discarded

- 22 - Basic Block Reordering Algorithm Example v Final set of chains »A »E-N-B-C-D-F-H »I-J-L »G-O »K »M v Ordering of the chains must be such that non- taken conditional branches are forward branches

- 23 - Basic Block Reordering Algorithm Example v 6 conditional branches »At B: chain 2 (B) before chain 4 (O) »At C : chain 2 (C) before chain 4 (G) »At F : chain 2 (F) before chain 3 (I) »At I : chain 3 (I) before chain 6 (M) »At J : chain 3 (J) before chain 5 (K) »At D : Link D-E should be forward, but since it is part of the same chain, no reordering v Final order of chains A, E-N-B-C-D-F-H, I-J-L, G-O, K, M v Final set of chains »A »E-N-B-C-D-F-H »I-J-L »G-O »K »M

- 24 - Basic Block Reordering Algorithm v Algorithm 2 : Top down approach v Objective : create chains of basic blocks v Input : graph of basic blocks with edge weights being profile count v Simple depth first approach v Begin with a node, place the successor with higher edge weight immediately after and continue till the successor is not already placed. Start again at unselected node and continue till all nodes have been visited once.

- 25 - Basic Block Reordering Algorithm Example v Begin at A v A –B –C –D – F –H – N v Continue at E (since it was the highest edge weight seen but not selected) v Continue at I and so on v Final ordering v A-B-C-D-F-H-N-E-I-J- L-O-K-G-M

- 26 - Top down vs Bottom Up approach v Order of basic blocks »Bottom up – A-E-N-B-C-D-F-H-I-J-L-G-O-K-M »Top down - A-B-C-D-F-H-N-E-I-J-L-O-K-G-M v Both approaches convert conditional branches to forward branches (usually not taken) v Advantage of bottom up over top down approach »Better at removing unconditional branches

- 27 - Procedure Splitting v Basic blocks that are frequently executed are located towards the top of the procedure (primary portion) v Infrequent code located at the bottom “fluff” v Procedure splitting is the process of separating the fluff from the primary portion v Addition of stubs with long branches to relocated fluff region v Creation of a new procedure with the fluff blocks.

- 28 - Procedure Splitting v Reduces the number of pages required for hot code from 2 to 1

- 29 - Experimental findings PP – Procedure positioning; BBP – Basic Block Positioning, PS – Procedure splitting Performance breakdown for small application - Othello

- 30 - Procedure Positioning Experimental Data Number of static long branches increased However, number of executed long branches is drastically lowered

- 31 - Procedure Splitting Experimental Data v Overhead of creating stubs are very low v Ratio of number of fluff instructions moved to long branches inserted

- 32 - Inferences v Side effect to Basic Block positioning »Reduction in the number of executed instructions »Straightening of if-then sequences nullified the delay slots (counted as wasted instructions) »Removal of infrequent basic blocks removes the need for unconditional jump to get around the infrequent block v BBP alone could achieve 2-3 percent in performance simply due to reducing the number of executed instructions v The average number of instructions executed before taking a branch increased form 6.19 to 8.09, a 31% improvement. v The number of executed penalty branches was reduced by 42%

- 33 - Drawbacks v 2 pass compilation v Debugging Positioned Code v No reuse of data »Necessity to recollect profiling information v Representative inputs »Input used during profiling phase v Fluff block size not considered

- 34 - Procedure vs. Basic block level Procedure Positioning (PP)Basic block Positioning (BBP) Linker levelCompiler level Larger granularityFiner granularity Hot and cold code sections are interspersed Hot code is closer together at the top of the procedure Lower benefits compared to BBPAllows procedure splitting Call graphs require lesser space (fewer nodes in comparison to graph for BBP) Massive control graphs, large number of basic blocks

Example 1: Online feedback directed optimization of Java Online vs offline Online strategies Code reordering result

- 36 - Online profiling vs. Offline profiling Online profilingOffline profiling Difficult to implementEasier to implement High runtime costNo runtime cost (profiling is done on a prior run of the program) Optimize within methodsOptimize complete program Optimization happen right awayOptimization happened next round Profiling information may not always be available Profiling information available prior to execution Capable of capturing phase behavior of the application Captures average behavior of the application Lower accuracyBetter accuracy

- 37 - Online profile strategies 1. Profile early during unoptimized execution. »Profile during the interoperate stage »Once detect the hot code, optimize the code in code cache. »No profiling or instrumentation when executing optimized code v Advantages : »Low overhead on profile (degradation is hidden by interpreter latency) »Profiling information is available early for earlier FDO v Disadvantages : »Profiling of optimized code is not always accurate »Optimization might be ineffective or counter- productive.

- 38 - Online profile strategies 2. Profile optimized code »Profile after interpretation, during lightweight optimization phase »Code patching- Remove instrumentation of optimized code after 3-4 runs of the optimized code »Profile in short bursts to avoid overhead Recording complete information in the system is not possible Information may not be Representative of Program behavior May introduce architecture Specific complexities Such as maintaining cache coherency Identifying this phase is difficult

- 39 - Online feedback directed optimization of Java v Code reordering »Based on static heuristic- mark block as cold. (e.g. exception handlers). v Move cold block to the bottom of the procedure v Top down approach of Pettis & Hansen

- 40 - Improvement due to code reordering Source : Online feedback directed optimization of Java

- 41 - Improvement due to code reordering Source : Online feedback directed optimization of Java

Example 2: The StarJIT Compiler: A Dynamic Compiler for Managed Runtime Environments Overall Tail-duplication Result

- 43 - The StarJIT Compiler: A Dynamic Compiler for Managed Runtime Environments v Specific target for Intel architecture (Itanium Processor Family) v Single compilation »Similar to static compiler v Tranforms Java bytecode or CLI bytecode to native code v Aggressive optimization using static heuristics v Online Profile guided Optimization v Global Optimizer – change profile intensity overtime

- 44 - Source : The StarJIT Compiler: A Dynamic Compiler for Managed Runtime Environments, Intel

- 45 - Source : The StarJIT Compiler: A Dynamic Compiler for Managed Runtime Environments, Intel

- 46 - Optimizations performed v Trace picker- picks longer trace (the main trace) v Tail duplication- eliminate cold side entry v Method splitting (procedure splitting) »Partitioning compiled code into hot and cold section.

- 47 - Tail duplication v Tail Duplication: Create Superblocks by first identifying traces and then eliminating side entries into a trace. v Create a separate off-trace copy of the basic blocks in between a side entrance and the trace exit and redirecting the edge corresponding to the side entry to the copy.

- 48 - Tail duplication example A If (condition){ B }else{ C } D

- 49 - Tradeoff of Tail Duplication Benefit v Only change the branch to target. No additional modify on the copied or the original code. => little overhead v Superblock – high-performance code of hot trace. Disadvantage: v Code growth.

- 50 - Tradeoffs v Trace scheduling and selection after code layout to schedule the unconditional branches. v But tail duplication must be done after the code layout is finalized. v Tail duplication decisions are best made with input from trace formation. Trace Picking Code Layout Tail Duplication

- 51 - Results Source : Comparing tail duplication with compensation code in single path global instruction, v Average 2.03% speedup v Average 1.51% code growth

- 52 - Conclusions v Code layout optimizations are done to improve instruction locality v The main idea is to bring hot code closer together on the same page to reduce the number of long branches and i- cache misses v Other benefits »Longer sequence of code executed before taking a branch »Larger number of useful instruction per cacheline »Fewer cache misses, denser instruction stream »Reduced branch misprediction »Better use of branch delay slots v Procedure splitting, procedure positioning, basic block reordering are the main techniques used v Top-down and bottom-up approach

- 53 - Conclusions v Feedback driven optimizations require profiling information that can be »Online »Offline v Tradeoffs on profiling techniques are »increase in runtime, »two passes, »representativeness of information gathered, »Phase nature of the application »Ease of implementation »Overhead of instrumenting and sampling

Dynamic Compilation Code Layout Overview  Background and Motivation v Approaches »Proﬁle guided code positioning v Results v More Examples »Online.

Similar presentations

Presentation on theme: "Dynamic Compilation Code Layout Overview  Background and Motivation v Approaches »Proﬁle guided code positioning v Results v More Examples »Online."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dynamic Compilation Code Layout Overview  Background and Motivation v Approaches »Proﬁle guided code positioning v Results v More Examples »Online.

Similar presentations

Presentation on theme: "Dynamic Compilation Code Layout Overview  Background and Motivation v Approaches »Proﬁle guided code positioning v Results v More Examples »Online."— Presentation transcript:

Similar presentations

About project

Feedback