Presentation is loading. Please wait.

Presentation is loading. Please wait.

CompilingforIA-64 Carol Thompson Optimization Architect Hewlett Packard.

Similar presentations


Presentation on theme: "CompilingforIA-64 Carol Thompson Optimization Architect Hewlett Packard."— Presentation transcript:

1 CompilingforIA-64 Carol Thompson Optimization Architect Hewlett Packard

2 History of ILP Compilers CISC era: no significant ILPCISC era: no significant ILP –Compiler is merely a tool to enable use of high- level language, at some performance cost RISC era: advent of ILPRISC era: advent of ILP –Compiler-influenced architecture –Instruction scheduling becomes important EPIC era: ILP as driving forceEPIC era: ILP as driving force –Compiler-specified ILP

3 Increasing Scope for ILP Compilation Early RISC CompilersEarly RISC Compilers –Basic block scope (delimited by branches & branch targets) Superscalar RISC and early VLIW CompilersSuperscalar RISC and early VLIW Compilers –Trace scope (single entry, single path) –Superblocks & Hyperblocks (single entry, multiple path) EPIC CompilersEPIC Compilers –Composite regions: multiple entry, multiple path Composite Regions Traces Superblock Basic Blocks

4 Unbalanced and Unbiased Control Flow Most code is not well balancedMost code is not well balanced –Many very small blocks –Some very large –Then and else clause frequently unbalanced –Number of instructions –Pathlength Many branches are highly biasedMany branches are highly biased –But some are not –Compiler can obtain frequency information from profiling or derive heuristically 60 0 0 40 55 5 5 40

5 Basic Blocks Basic Blocks are simpleBasic Blocks are simple –No issues with executing unnecessary instructions –No speculation or predication support required But, very limited ILPBut, very limited ILP –Short blocks offer very little opportunity for parallelism –Long latency code is unable to take advantage of issue bandwidth in an earlier block 60 0 0 40 55 5 5 40

6 Traces 60 0 0 40 55 5 5 40 Traces allow scheduling of multiple blocks togetherTraces allow scheduling of multiple blocks together –Increases available ILP –Long latency operations can be moved up, as long as they are on the same trace But, unbiased branches are a problemBut, unbiased branches are a problem –Long latency code in slightly less frequent paths can’t move up –Issue bandwidth may go unused (not enough concurrent instructions to fill available execution units)

7 60 0 0 40 55 5 40 5 5 Superblocks and Hyperblocks Superblocks and Hyperblocks allow inclusion of multiple important pathsSuperblocks and Hyperblocks allow inclusion of multiple important paths –Long latency code may migrate up from multiple paths –Hyperblocks may be fully predicated –More effective utilization of issue bandwidth But, requires code duplicationBut, requires code duplication Wholesale predication may lengthen important pathsWholesale predication may lengthen important paths

8 Composite Regions Allow rejoin from non-Region codeAllow rejoin from non-Region code –Wholesale code duplication is not required –Support full code motion across region –Allow all interesting paths to be scheduled concurrently Nested, less important Regions bear the burden of the rejoinNested, less important Regions bear the burden of the rejoin –Compensation code, as needed 60 0 0 40 55 5 5 40

9 Predication Approaches Full Predication of entire RegionFull Predication of entire Region –Penalizes short paths 60 0 0 40 55 5 5 40

10 On-Demand Predication Predicate (and Speculate) as neededPredicate (and Speculate) as needed –reduce critical path(s) –fully utilize issue bandwidth Retain control flow to accommodate unbalanced pathsRetain control flow to accommodate unbalanced paths 60 0 0 40 55 5 5 40

11 Predicate Analysis Instruction scheduler requires knowledge of predicate relationshipsInstruction scheduler requires knowledge of predicate relationships –For dependence analysis –For code motion –… Predicate Query SystemPredicate Query System –Graphical representation of predicate relationships –Superset, subset, disjoint, …

12 Predicate Computation Compute all predicates possibly neededCompute all predicates possibly needed OptimizeOptimize –to share predicates where possible –to utilize parallel compares –to fully utilize dual-targets

13 Predication and Branch Counts Predication reduces branchesPredication reduces branches –at both moderate and aggressive opt. levels

14 Predication & Branch Prediction Comparable misprediction rate with predicationComparable misprediction rate with predication –despite significantly fewer branches Îincreased mean time between mispredicted branches

15 Register Allocation Modeled as a graph-coloring problem.Modeled as a graph-coloring problem. –Nodes in the graph represent live ranges of variables –Edges represent a temporal overlap of the live ranges –Nodes sharing an edge must be assigned different colors (registers) x =... y =... =... x z =... = … y = … z y z x Requires Two Colors y z x

16 Register Allocation x =... y =... x z y With Control Flow z =... = … z = … y x =... = … x x y z Requires Two Colors

17 Register Allocation x z y With Predication x x =... y =...z =... = …yx =... = …z = … x z Now Requires Three Colors y

18 Predicate Analysis p0 p2p1 x y x =... y =...z =... = …yx =... = …z = … x z p1 and p2 are disjoint If p1 is TRUE, p2 is false and vice versa

19 Register Allocation x z y With Predicate Analysis x y x =... y =...z =... = …yx =... = …z = … x z Now Back to Two Colors

20 Effect of Predicate-Aware Register Allocation Reduces register requirements for individual procedures by 0% to 75%Reduces register requirements for individual procedures by 0% to 75% –Depends upon how aggressively predication is applied Average dynamic reduction in register stack allocation for gcc is 4.7%Average dynamic reduction in register stack allocation for gcc is 4.7%

21 Object-Oriented Code ChallengesChallenges –Small Procedures, many indirect (virtual) –Limits size of regions, scope for ILP –Exception Handling –Bounds Checking (Java) –Inherently serial - must check before executing load or store SolutionsInlining for non-virtual functions or provably unique virtual functions Speculative inlining for most common variant Liveness analysis of handlers Architectural support for speculation ensures recoverability Speculative execution Guarantees correct exception behavior Dynamic optimization (e..g Java) Make use of dynamic profile

22 Method Calls Barrier between execution streamsBarrier between execution streams Often, location of called method must be determined at runtimeOften, location of called method must be determined at runtime –Costly “identity check” on object must complete before method may begin –Even if the call nearly always goes to the same place –Little ILP Resolve target method Call-dependent code Possible target Possible target Possible target

23 Speculating Across Method Calls Compiler predicts target methodCompiler predicts target method –Profiling –Current state of class hierarchy Predicted method is inlinedPredicted method is inlined –Full or partial Speculative execution of called method begins while actual target is determinedSpeculative execution of called method begins while actual target is determined

24 Speculation Across Method Calls Resolve target method call method Dominant called method Other target method Other target method call other method if needed Dominant called method Other target method Other target method Resolve target method

25 Bounds & Null Checks Checks inhibit code motionChecks inhibit code motion Null checksNull checks x = y.foo;if( y == null ) throw NullPointerException; x = y.foo; Bounds checksBounds checks x = a[i];if( a == null ) throw NullPointerException; if( i = a.length) throw ArrayIndexOutOfBounds Exception; x = a[i];

26 Speculating Across Bounds Checks Bounds checks rarely failBounds checks rarely fail x = a[i];ld.st = a[i]; if( a == null ) throw NullPointerException; if( i = a.length) throw ArrayIndexOutOfBoundsException; chk.st x = t; Long latency load can begin before checksLong latency load can begin before checks

27 Exception Handling Exception handling inhibits motion of subsequent codeException handling inhibits motion of subsequent code if( y.foo ) throw MyException; x = y.bar + z.baz;

28 Speculation in the Presence of Exception Handling Execution of subsequent instructions may begin before exception is resolvedExecution of subsequent instructions may begin before exception is resolved if( y.foo ) throw MyException; x = y.bar + z.baz; ldt1 = y.foo ld.st2 = y.bar ld.st3 = z.baz addx = t2 + t3 if( t1 ) throw MyException; chk.sx

29 Dependence Graph for Instruction Scheduling add t1 = 8,p (p1) ld4 t3 = [log] (p1) add t2 = 1,t2 mov out0 = 0 br.ret rp (p1) ld4 out0 = [t4] shladd t4 = n,4,t3 (p1) ld4 t3 = [p] (p1) st4 [log] = t2 ld4 count = [t1] cmp4.ge p1,p2=n,count If( n count ) { (*log)++; (*log)++; return p->x[n]; return p->x[n]; } else { return 0; return 0;}

30 Dependence Graph with Predication & Speculation add t1 = 8,p (p1) ld4 t3 = [log] (p1) add t2 = 1,t2 mov out0 = 0 br.ret rp (p1) ld4 out0 = [t4] shladd t4 = n,4,t3 (p1) ld4 t3 = [p] (p1) st4 [log] = t2 ld4 count = [t1] cmp4.ge p1,p2=n,count chk.a t4 chk.a p During dependence graph construction, potentially control and data speculative edges and nodes are identifiedDuring dependence graph construction, potentially control and data speculative edges and nodes are identified Check nodes are added where possibly needed (note that only data speculation checks are shown here)Check nodes are added where possibly needed (note that only data speculation checks are shown here)

31 Dependence Graph with Predication & Speculation add t1 = 8,p(p1) ld4 t3 = [log] (p1) add t2 = 1,t2 (p2) mov out0 = 0 br.ret rp (p1) ld4 out0 = [t4] shladd t4 = n,4,t3 (p1) ld4 t3 = [p] (p1) st4 [log] = t2 ld4 count = [t1] cmp4.ge p1,p2=n,count chk.a t4chk.a p Speculative edges may be violated. Here the graph is re-drawn to show the enhanced parallelismSpeculative edges may be violated. Here the graph is re-drawn to show the enhanced parallelism Note that the speculation of both writes to the out0 register would require insertion of a copy. The scheduler must consider this in its schedulingNote that the speculation of both writes to the out0 register would require insertion of a copy. The scheduler must consider this in its scheduling Nodes with sufficient slack (e.g. writes to out0) will not be speculatedNodes with sufficient slack (e.g. writes to out0) will not be speculated

32 Conclusions IA-64 compilers push the complexity of the compilerIA-64 compilers push the complexity of the compiler –However, the technology is a logical progression from today’s –Today’s RISC compilers – are more complex – are more reliable – and deliver more performance than those of the early days –Complexity trend is mirrored in both hardware and applications –Need a balance to maximize benefits from each


Download ppt "CompilingforIA-64 Carol Thompson Optimization Architect Hewlett Packard."

Similar presentations


Ads by Google