Just-In-Time Java Compilation for the Itanium Processor Tatiana Shpeisman Guei-Yuan Lueh Ali-Reza Adl-Tabatabai Intel Labs
2 Introduction Itanium processor is statically scheduled machine Aggressive compiler techniques to extract ILP Just-In-Time (JIT) compiler must be fast Must consider time & space efficiency of optimizations Balance compilation time with code quality Light-weight compilation techniques Use heuristics for modeling micro architecture Leverage semantics and meta data of JVM
3 Outline Introduction Compiler overview Register allocation Code scheduling Other optimizations Conclusions
4 Compiler Structure Prepass Inlining Global optimizations IR construction Code Selection Register Allocation Code Emission GC Support Front-end Back-end Code Scheduling Predication
5 Register Allocation Compilation time vs. code quality tradeoff IPF architecture has large register files 128 integer, 128 floating-point, 64 predicate, 8 branch Register Stack Engine (RSE) provides 96 stack registers to each procedure Use linear scan register allocation “Linear Scan Register Allocation” by Massimiliano Poletto and Vivek Sarkar
6 Live Range vs. Live Interval... t1= t1= v =t1 v = t1 = v...= v B1 B2B3 B4 t2= t2= v =t2 v = t2 t1= t1= v =t1 v = t1 t2= t2= v = t2 = v...= v... B1 B2 B4 B3 Live Ranges Live Intervals
7 Coalescing Algorithm Coalesce v and t in v = t iff Live interval of t ends at v = t Live interval of t does not intersect with live range of v Requires one additional reverse pass over IR O(N INST + N VAR * N BB ) t1= t1= v =t1 v = t1 t2= t2= v = t2 = v...= v... B1 B2 B4 B3
8 Coalescing Speedup
9 Code Scheduling Forward cycle-based list scheduling Scheduling unit is extended basic block Middle exits are due to run-time exceptions (p6,p7) = cmp.eq r35, 0 (p6,p7) = cmp.eq r35, 0 (p6) br ThrowNullPointerException r10 = r r10 = r r11 = ld8 [r10] r11 = ld8 [r10]
10 Type-based memory disambiguation Use JVM meta data to disambiguate memory locations Type Integer, floating-point, object reference … Kind Object field, array element, virtual table address … Field id putfield #10 vs. putfield #15
11 Type-Based Disambiguation
12 Exception Dependencies Java exceptions are precise Naive approach Exception checks end basic blocks Our approach Instruction depends on exception check iff Its destination is live at the exception handler, or It is an exception check for different exception type It is a memory reference that may be guarded by check
13 Exception Dependency Example 1:(p6, p0) = cmp.eq r16, 0 2:(p6)brThrowNullPointerException 2: (p6)br ThrowNullPointerException 6: f8 = fld [r21]// load static 5: r21 = movl 0x000F14E :r18 = ld [r17]// load field 3:r17 = add r16, 8
14 Exception Dependencies
15 IPF Architecture Execution (functional) unit type – M, I, F, B Instruction (syllable type) – M, A, I, F, B, IL Bundles, templates .mii.mi;;i.mil.mmi.m;;mi.mfi.mmf.mib.mbb.bbb.mmb.mfb Instruction group – no WAR, WAW with some exceptions.mi;;ir10 = ld [r15] r9 = add r8, 1 ;; // stop bit r16 = shr r9, r32
16 Template Selection Pack instructions into bundles Choose slot for each instruction Insert NOP instructions Assign instructions to functional units Problem: Resource over subscription Inaccurate bypass latencies
17 Greedy slot assignment Sort instruction by syllable type M < F < IL < I < A < B I1: r20 = sxt r14 (I-type) I2: r21 = movl ADDR (IL-type) I3: f15 = fadd f10, f11 (F-type) Algorithm NOPI1 NOP I2 NOPI3 NOP Unsorted NOPI3I1 NOPI2 Sorted
18 Template Selection Heuristics
19 Bypass Latency Accuracy r17 = add r16, 8 M-Unit r17 = add r16, 8 I-Unit r18 = ld [r17] M-Unit 12 Phase ordering of functional unit assignment Code selection time is too early: underutilizes resources Template selection time too late: inaccurate scheduling latencies Solution: Assign to functional unit during scheduling Assign to M-Unit if available, else Assign to I-Unit and increment latency
20 Modeling of Address Computation Latency
21 Other optimizations Predication Profitability depends on a benchmark Performance variations within 2% Branch hints Up to 50% speedup from using branch hints Sign-extension elimination 1% potential gain for our compiler
22 Conclusions Light-weight optimizations techniques for Itanium Considering micro architecture is important Cannot ignore bypass latencies Template selection should be resource sensitive Language semantics helps to improve ILP Type-based memory disambiguation Exception dependency elimination