Www.intel.com/labs Just-In-Time Java Compilation for the Itanium Processor Tatiana Shpeisman Guei-Yuan Lueh Ali-Reza Adl-Tabatabai Intel Labs.

Slides:

Advertisements

Similar presentations

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.

Advertisements

Register Allocation Zach Ma.

Instruction-Level Parallel Processors {Objective: executing two or more instructions in parallel} 4.1 Evolution and overview of ILP-processors 4.2 Dependencies.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

Chapter 4 Predication CSE 820. Michigan State University Computer Science and Engineering Go over midterm exam.

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,

U P C MICRO36 San Diego December 2003 Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors Enric Gibert 1 Jesús Sánchez 2 Antonio González.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

1 Overview Assignment 5: hints  Garbage collection Assignment 4: solution.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.

Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.

JAVA Processors and JIT Scheduling. Overview & Literature n Formulation of the problem n JAVA introduction n Description of Caffeine * Literature: “Java.

Lecture 01 - Introduction Eran Yahav 1. 2 Who? Eran Yahav Taub 734 Tel: Monday 13:30-14:30

Intel Itanium 2 Processor Intel’s Server Solution Raymond Ball April 2, 2004.

1 Handling nested procedures Method 1 : static (access) links –Reference to the frame of the lexically enclosing procedure –Static chains of such links.

Cpeg421-08S/final-review1 Course Review Tom St. John.

November 29, 2005Christopher Tuttle1 Linear Scan Register Allocation Massimiliano Poletto (MIT) and Vivek Sarkar (IBM Watson)

Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

Performance Potentials of Compiler- directed Data Speculation Author: Youfeng Wu, Li-Ling Chen, Roy Ju, Jesse Fang Programming Systems Research Lab Intel.

Chapter 2 Instruction-Level Parallelism and Its Exploitation

Multiscalar processors

1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)

Topic 6 -Code Generation Dr. William A. Maniatty Assistant Prof. Dept. of Computer Science University At Albany CSI 511 Programming Languages and Systems.

Register Allocation and Spilling via Graph Coloring G. J. Chaitin IBM Research, 1982.

1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.

Linear Scan Register Allocation POLETTO ET AL. PRESENTED BY MUHAMMAD HUZAIFA (MOST) SLIDES BORROWED FROM CHRISTOPHER TUTTLE 1.

Generic Software Pipelining at the Assembly Level Markus Pister

IA-64 ISA A Summary JinLin Yang Phil Varner Shuoqi Li.

P ARALLEL P ROCESSING I NSTITUTE · F UDAN U NIVERSITY 1.

Fast, Effective Code Generation in a Just-In-Time Java Compiler Rejin P. James & Roshan C. Subudhi CSE Department USC, Columbia.

The Arrival of the 64bit CPUs - Itanium1 นายชนินท์วงษ์ใหญ่รหัส นายสุนัยสุขเอนกรหัส

Compiler course 1. Introduction. Outline Scope of the course Disciplines involved in it Abstract view for a compiler Front-end and back-end tasks Modules.

Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.

Java Virtual Machine Case Study on the Design of JikesRVM.

Hardware Support for Compiler Speculation

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

Introducing The IA-64 Architecture - Kalyan Gopavarapu - Kalyan Gopavarapu.

1 Compiler Design (40-414)  Main Text Book: Compilers: Principles, Techniques & Tools, 2 nd ed., Aho, Lam, Sethi, and Ullman, 2007  Evaluation:  Midterm.

Code Generation Ⅰ CS308 Compiler Theory1. 2 Background The final phase in our compiler model Requirements imposed on a code generator –Preserving the.

1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.

Chapter# 6 Code generation.  The final phase in our compiler model is the code generator.  It takes as input the intermediate representation(IR) produced.

Instruction Rescheduling and Loop-Unroll Department of Computer Science Southern Illinois University Edwardsville Fall, 2015 Dr. Hiroshi Fujinoki

Lecture 04: Instruction Set Principles Kai Bu

Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.

Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.

A Region-Based Compilation Technique for a Java Just-In-Time Compiler Toshio Suganuma, Toshiaki Yasue and Toshio Nakatani Presenter: Ioana Burcea.

Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.

Unit II Intel IA-64 and Itanium Processor By N.R.Rejin Paul Lecturer/VIT/CSE CS2354 Advanced Computer Architecture.

IA64 Complier Optimizations Alex Bobrek Jonathan Bradbury.

Eliminating External Fragmentation in a Non-Moving Garbage Collector for Java Author: Fridtjof Siebert, CASES 2000 Michael Sallas Object-Oriented Languages.

Compiler Construction (CS-636)

COSC3330 Computer Architecture

Linear Scan Register Allocation Massimiliano Poletto, Vivek Sarkar A Fast, Memory-Efficient Register Allocation Framework for Embedded Systems Sathyanarayanan.

Henk Corporaal TUEindhoven 2009

Wrapping Up Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit.

Yingmin Li Ting Yan Qi Zhao

Adapted from the slides of Prof

How to improve (decrease) CPI

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Henk Corporaal TUEindhoven 2011

Instruction Level Parallelism (ILP)

Adapted from the slides of Prof

Lecture 16: Register Allocation

CSC3050 – Computer Architecture

Lecture 17: Register Allocation via Graph Colouring

Research: Past, Present and Future

Presentation transcript:

Just-In-Time Java Compilation for the Itanium Processor Tatiana Shpeisman Guei-Yuan Lueh Ali-Reza Adl-Tabatabai Intel Labs

2 Introduction  Itanium processor is statically scheduled machine  Aggressive compiler techniques to extract ILP  Just-In-Time (JIT) compiler must be fast  Must consider time & space efficiency of optimizations  Balance compilation time with code quality  Light-weight compilation techniques  Use heuristics for modeling micro architecture  Leverage semantics and meta data of JVM

3 Outline  Introduction  Compiler overview  Register allocation  Code scheduling  Other optimizations  Conclusions

4 Compiler Structure Prepass Inlining Global optimizations IR construction Code Selection Register Allocation Code Emission GC Support Front-end Back-end Code Scheduling Predication

5 Register Allocation  Compilation time vs. code quality tradeoff  IPF architecture has large register files  128 integer, 128 floating-point, 64 predicate, 8 branch  Register Stack Engine (RSE) provides 96 stack registers to each procedure  Use linear scan register allocation  “Linear Scan Register Allocation” by Massimiliano Poletto and Vivek Sarkar

6 Live Range vs. Live Interval... t1= t1= v =t1 v = t1 = v...= v B1 B2B3 B4 t2= t2= v =t2 v = t2 t1= t1= v =t1 v = t1 t2= t2= v = t2 = v...= v... B1 B2 B4 B3 Live Ranges Live Intervals

7 Coalescing Algorithm  Coalesce v and t in v = t iff  Live interval of t ends at v = t  Live interval of t does not intersect with live range of v  Requires one additional reverse pass over IR  O(N INST + N VAR * N BB ) t1= t1= v =t1 v = t1 t2= t2= v = t2 = v...= v... B1 B2 B4 B3

8 Coalescing Speedup

9 Code Scheduling  Forward cycle-based list scheduling  Scheduling unit is extended basic block  Middle exits are due to run-time exceptions (p6,p7) = cmp.eq r35, 0 (p6,p7) = cmp.eq r35, 0 (p6) br ThrowNullPointerException r10 = r r10 = r r11 = ld8 [r10] r11 = ld8 [r10]

10 Type-based memory disambiguation  Use JVM meta data to disambiguate memory locations  Type  Integer, floating-point, object reference …  Kind  Object field, array element, virtual table address …  Field id  putfield #10 vs. putfield #15

11 Type-Based Disambiguation

12 Exception Dependencies  Java exceptions are precise  Naive approach  Exception checks end basic blocks  Our approach  Instruction depends on exception check iff  Its destination is live at the exception handler, or  It is an exception check for different exception type  It is a memory reference that may be guarded by check

13 Exception Dependency Example 1:(p6, p0) = cmp.eq r16, 0 2:(p6)brThrowNullPointerException 2: (p6)br ThrowNullPointerException 6: f8 = fld [r21]// load static 5: r21 = movl 0x000F14E :r18 = ld [r17]// load field 3:r17 = add r16, 8

14 Exception Dependencies

15 IPF Architecture  Execution (functional) unit type – M, I, F, B  Instruction (syllable type) – M, A, I, F, B, IL  Bundles, templates .mii.mi;;i.mil.mmi.m;;mi.mfi.mmf.mib.mbb.bbb.mmb.mfb  Instruction group – no WAR, WAW with some exceptions.mi;;ir10 = ld [r15] r9 = add r8, 1 ;; // stop bit r16 = shr r9, r32

16 Template Selection  Pack instructions into bundles  Choose slot for each instruction  Insert NOP instructions  Assign instructions to functional units Problem: Resource over subscription Inaccurate bypass latencies

17  Greedy slot assignment  Sort instruction by syllable type  M < F < IL < I < A < B I1: r20 = sxt r14 (I-type) I2: r21 = movl ADDR (IL-type) I3: f15 = fadd f10, f11 (F-type) Algorithm NOPI1 NOP I2 NOPI3 NOP Unsorted NOPI3I1 NOPI2 Sorted

18 Template Selection Heuristics

19 Bypass Latency Accuracy r17 = add r16, 8 M-Unit r17 = add r16, 8 I-Unit r18 = ld [r17] M-Unit 12 Phase ordering of functional unit assignment Code selection time is too early: underutilizes resources Template selection time too late: inaccurate scheduling latencies Solution: Assign to functional unit during scheduling Assign to M-Unit if available, else Assign to I-Unit and increment latency

20 Modeling of Address Computation Latency

21 Other optimizations  Predication  Profitability depends on a benchmark  Performance variations within 2%  Branch hints  Up to 50% speedup from using branch hints  Sign-extension elimination  1% potential gain for our compiler

22 Conclusions  Light-weight optimizations techniques for Itanium  Considering micro architecture is important  Cannot ignore bypass latencies  Template selection should be resource sensitive  Language semantics helps to improve ILP  Type-based memory disambiguation  Exception dependency elimination