© 2011 IBM Corporation Reducing Trace Selection Footprint for Large- scale Java Applications without Performance Loss Peng Wu, Hiroshige Hayashizaki, Hiroshi.

Slides:

Advertisements

Similar presentations

Inside an XSLT Processor Michael Kay, ICL 19 May 2000.

Advertisements

Runtime Feedback in a Meta-Tracing JIT for Efficient Dynamic Languages Writer: Carl Friedrich Bolz Introduced by Ryotaro IKEDA at 2011/09/06.

1 Symbol Tables. 2 Contents Introduction Introduction A Simple Compiler A Simple Compiler Scanning – Theory and Practice Scanning – Theory and Practice.

IBM JIT Compilation Technology AOT Compilation in a Dynamic Environment for Startup Time Improvement Kenneth Ma Marius Pirvu Oct. 30, 2008.

SSA and CPS CS153: Compilers Greg Morrisett. Monadic Form vs CFGs Consider CFG available exp. analysis: statement gen's kill's x:=v 1 p v 2 x:=v 1 p v.

Data-Flow Analysis II CS 671 March 13, CS 671 – Spring Data-Flow Analysis Gather conservative, approximate information about what a program.

1 Optimization Optimization = transformation that improves the performance of the target code Optimization must not change the output must not cause errors.

Intermediate Code Generation

Course Outline Traditional Static Program Analysis –Theory Compiler Optimizations; Control Flow Graphs Data-flow Analysis – today’s class –Classic analyses.

1 Code Optimization. 2 The Code Optimizer Control flow analysis: control flow graph Data-flow analysis Transformations Front end Code generator Code optimizer.

Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.

Testarossa JIT Compilation Technology © 2012 IBM Corporation Exceptions: Not so rare as you'd think --Handling Exception Faster Chao Chen, Nikola Grcevski.

© 2011 IBM Corporation Reducing Trace Selection Footprint for Large- scale Java Applications without Performance Loss Peng Wu, Hiroshige Hayashizaki, Hiroshi.

Trace-based Just-in-Time Type Specialization for Dynamic Languages Andreas Gal, Brendan Eich, Mike Shaver, David Anderson, David Mandelin, Mohammad R.

Trace-Based Automatic Parallelization in the Jikes RVM Borys Bradel University of Toronto.

The Use of Traces for Inlining in Java Programs Borys J. Bradel Tarek S. Abdelrahman Edward S. Rogers Sr.Department of Electrical and Computer Engineering.

Compilation Technology October 17, 2005 © 2005 IBM Corporation Software Group Reducing Compilation Overhead in J9/TR Marius Pirvu, Derek Inglis, Vijay.

Previous finals up on the web page use them as practice problems look at them early.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Topic 6 -Code Generation Dr. William A. Maniatty Assistant Prof. Dept. of Computer Science University At Albany CSI 511 Programming Languages and Systems.

Introduction to Optimization Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

Run Time Optimization : Optimizing Compilers Pedro Artigas.

Syntax Directed Translation. Syntax directed translation Yacc can do a simple kind of syntax directed translation from an input sentence to C code We.

©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 20 Slide 1 Defect testing l Testing programs to establish the presence of system defects.

Hans-Peter Plag October 16, 2014 Session 3 Programming Languages Data Types and Variables Expressions and Operators Flow Control.

Prof: Dr. Shu-Ching Chen TA: Samira Pouyanfar Spring 2015 C Tutorial CIS5027.

Adaptive Optimization in the Jalapeño JVM Matthew Arnold Stephen Fink David Grove Michael Hind Peter F. Sweeney Source: UIUC.

EMSOFT’02 Silicomp Research Institute JCOD 1 JCOD A Lightweight Modular Compilation Technology For Embedded Java Bertrand Delsart :

Java Virtual Machine Case Study on the Design of JikesRVM.

Fossils – Day 2 Review of fossil types.

“Software” Esterel Execution (work in progress) Dumitru POTOP-BUTUCARU Ecole des Mines de Paris

Lengthening Traces to Improve Opportunities for Dynamic Optimization Chuck Zhao, Cristiana Amza, Greg Steffan, University of Toronto Youfeng Wu Intel Research.

Computer Science 313 – Advanced Programming Topics.

Virtual Support for Dynamic Join Points C. Bockisch, M. Haupt, M. Mezini, K. Ostermann Presented by Itai Sharon

Targeted Path Profiling : Lower Overhead Path Profiling for Staged Dynamic Optimization Systems Rahul Joshi, UIUC Michael Bond*, UT Austin Craig Zilles,

1 Control Flow Analysis Topic today Representation and Analysis Paper (Sections 1, 2) For next class: Read Representation and Analysis Paper (Section 3)

Just-In-Time Compilation Keith W. Krajewski 3/4/2011 paper: A Brief History of Just-In-Time (2003) John Aycock.

A Region-Based Compilation Technique for a Java Just-In-Time Compiler Toshio Suganuma, Toshiaki Yasue and Toshio Nakatani Presenter: Ioana Burcea.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Method Profiling John Cavazos University.

Copyright © Curt Hill Flow of Control A Quick Overview.

Machine-Independent Optimizations Ⅳ CS308 Compiler Theory1.

1 ROGUE Dynamic Optimization Framework Using Pin Vijay Janapa Reddi PhD. Candidate - Electrical And Computer Engineering University of Colorado at Boulder.

Introduction to Python memory management. Find the slides at: manuelschipper.com/slides.pdf.

PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,

Workload Scheduler plug-in for JSR 352 Java Batch IBM Workload Scheduler IBM.

BPF+ Exploiting Global Data-flow Optimization in a Packet Filter Architecture Andrew Begel, Steven McCanne, Susan L. Graham University of California, Berkeley.

COMP 2100 From Python to Java

Introduction to Optimization

Basic Program Analysis

Research in Compilers and Introduction to Loop Transformations Part I: Compiler Research Tomofumi Yuki EJCP 2016 June 29, Lille.

Optimization Using Tracing JITs

Compsci 6/101: PFTW What is Python? What is a programming language?

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

White-Box Testing.

The Simplest Heuristics May Be The Best in Java JIT Compilers

Introduction to Optimization

Control Flow Analysis CS 4501 Baishakhi Ray.

White-Box Testing.

محاضرة 1: مقدمة للمسـاق و مراجعـة للأساسيـات

Towards JIT compiler for IO language Dynamic mixin optimization

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Control Flow Analysis (Chapter 7)

Lecture 9 Dynamic Compilation

Trace-based Just-in-Time Type Specialization for Dynamic Languages

Introduction to Optimization

Proportions and Scale Factors

Dynamic Binary Translators and Instrumenters

Nikola Grcevski Testarossa JIT Compiler IBM Toronto Lab

Presentation transcript:

© 2011 IBM Corporation Reducing Trace Selection Footprint for Large- scale Java Applications without Performance Loss Peng Wu, Hiroshige Hayashizaki, Hiroshi Inoue, and Toshio Nakatani IBM Research Peng Wu October 20, 2011

© 2011 IBM Corporation Peng Wu Trace Compilation 2 Trace selection: how to form good compilation scope Trace-based Compilation in a Nut-shell Stems from a simple idea of building compilation scopes dynamically out of execution paths method f method entry trace exit return if (x != 0) rarely executed while (!end) do something frequently executed Optimization: scope-mismatch problem Common traps to misunderstand trace selection: Do not think about path profiling Think about trace recording Do not think about program structures Think about graph, path, split or join Do not think about global decisions Think about local decisions Code-gen: handle to handle trace exits

© 2011 IBM Corporation Peng Wu Trace Compilation 3 Trace Compilation in a Decade Loops All regions Coarse grained Loops One-pass trace selection (linear/cyclic traces) Multi-pass trace selection (trace trees) dynamo (binary) PyPy (Python) LuaJIT (Lua) Testarossa Trace-JIT (Java) Hotspot Trace-JIT (Java) SPUR (javascript) HotpathVM (Java) TraceMonkey (javascript) Increasing selection footprint DaCapo-9.12, WebSphere 1300~27000 traces spec <200 traces DaCapo traces, 1600 trees Java Grande <10 trees <600 traces <100 traces <70 trees <200 traces <100 trees YETI (Java) SpecJVM A A B B exit linear stub A A B B exit cyclic stub A A D D exit tree stub C C D D B B

© 2011 IBM Corporation Peng Wu Trace Compilation 4 An Example of Trace Duplication Problem Trace ATrace B Trace D Trace C In total, 4 traces (17BBs) are selected for a simple loop of 4BB+1BB Average BB duplication factor on DaCapo is 13

© 2011 IBM Corporation Peng Wu Trace Compilation 5 Understanding the Causes (I): Short-Lived Traces % traces selected by baseline algorithm with <500 execution frequency On average, 40% traces of DaCapo 9-12 are short lived trace A trace B 1.Trace A is formed before trace B, but node B dominates node A 2.Node A is part of trace B Trace A is formed first Trace B is formed later Afterwards, A is no longer entered SYMPTON ROOT CAUSE 1 2

© 2011 IBM Corporation Peng Wu Trace Compilation 6 Understanding the Causes (II): Excessive Duplication Problem Block duplication is inherent to any trace selection algorithm –e.g., most blocks following any join-node are duplicated on traces All trace selection algorithms have mechanisms to detect repetition –so that cyclic paths are not unrolled (excessively) But there are still many unnecessary duplications that do not help performance

© 2011 IBM Corporation Peng Wu Trace Compilation 7 Examples of Excessive Duplication Problem Example 1 Key: this is a very biased join-node Example 2 n trace buffer Q: breaking up a cyclic trace at inner-join point? Q: truncate trace at buffer length (n)? Hint: efficient to peel 1st iteration of a loop? Hint: whats the convergence of tracing large loop body of size m (m>n)?

© 2011 IBM Corporation Peng Wu Trace Compilation 8 1.Trace A and B are selected out of sync wrt topological order 2.Node A is part of trace B ROOT CAUSE A B Our Solution Reduce short-lived traces 1.Constructing precise BB –address a common pathological duplication in trace termination conditions 2.Change how trace head selection is done (most effective) –address out-of-order trace head selection 3.Clearing counters along recorded trace –favors the 1 st born 4.Trace path profiling –limit the negative effect of trace duplication Reduce excessive trace duplication 1.Structure-based truncation –Truncate at biased join-node (e.g., target of back-edge), etc 2.Profile-based truncation –Truncated tail of traces with low utilization based on trace profiling

© 2011 IBM Corporation Peng Wu Trace Compilation 9 Technique Example (I): Trace Path Profiling 1. Select promising BBs to monitor exec. count basic block 2. Selected a trace head, start recording a trace 3. Recorded a trace, then submit to compilation Original trace selection algorithm With trace path profiling 3.a. Keep on interpreting the (nursery) trace – monitor counts of trace entry and exits – do not update yellow counters on trace NOTE: Traces that never graduate from nursery are short-lived by definition! 3.b. When trace entry count exceeds threshold, graduate trace from nursery and compile Using nursery to select the topologically early one (i.e., favors strongest)

© 2011 IBM Corporation Peng Wu Trace Compilation 10 Evaluation Setup Benchmark –DaCapo benchmark suite 9.12 –DayTrader 2.0 running on WebSphere 7 (3-tier setup, DB2 and client on a separate machine) Our Trace-JIT –Extended IBM J9 JIT/VM to support trace compilation based on JDK for Java 6 (32-bit) support a subset of warm level optimizations in original J9 JIT 512 MB Java heap with large page enabled, generational GC –Steady-state performance of the baseline DaCapo: 4% slower than J9 JIT at full opt level DayTrader: 20% slower than J9 JIT at full opt level Hardware: IBM BladeCenter JS22 –4 cores (8 SMT threads) of POWER6 4.0GHz –16 GB system memory

© 2011 IBM Corporation Peng Wu Trace Compilation 11 Trace Selection Footprint after Applying Individual Techniques (normalized to baseline trace-JIT w/o any optimizations) Trace selection footprint: sum of bytecode sizes among all trace selected Lower is better Observation: each individual technique reduces selection footprint between 10%~40%.

© 2011 IBM Corporation Peng Wu Trace Compilation 12 Cumulative Effect of Individual Techniques on Trace Selection Footprint (Normalized to Baseline) Lower is better Observations: 1) each technique further improves selection footprint over previous techniques; 2) Cumulatively they reduce selection footprint to 30% of the baseline. steady-state time: unchanged, from 4% slowdown (luindex) to 10% speedup (WebSphere) start-up time: 57% baseline compilation time: 31% baseline binary size: 31% baseline

© 2011 IBM Corporation Peng Wu Trace Compilation 13 Breakdown of Source of Selection Footprint Reduction Most footprint reduction comes from eliminating short-lived traces Other reduction may come from better convergence of trace selection

© 2011 IBM Corporation Peng Wu Trace Compilation 14 Comparison with Other Size-control Heuristics We are the first to explicitly study selection footprint as a problem However, size control heuristics were used in other selection algorithms –Stop-at-loop-header (3% slower, 150% larger than ours) –Stop-at-return-from-method-of-trace-head (6% slower, 60% larger than ours) –Stop-at-existing-head (30% slower, 20% smaller than ours) Why is stop-at-existing-head so footprint efficient? – It does not form short-lived traces because a trace head cannot appear in another trace – It includes stop-at-loop-header because most loop headers become trace head Why is stop-at-existing-head so footprint efficient? – It does not form short-lived traces because a trace head cannot appear in another trace – It includes stop-at-loop-header because most loop headers become trace head A B

© 2011 IBM Corporation Peng Wu Trace Compilation Trace selection is more footprint efficient as only live codes are selected 3. Tail duplication is the major source of trace duplication 4. Shortening individual traces is the main weapon for footprint efficiency Common beliefs Our Grain of Salt – Duplication can lead to serious selection footprint explosion – There are other sources of unnecessary duplication: short-lived traces and poor selection convergence – Many trace shortening heuristics hurt performance – Proposed other means to curb footprint at no cost of performance – Many trace shortening heuristics hurt performance – Proposed other means to curb footprint at no cost of performance 1. Selection footprint is a non-issue as trace JITs target hot codes only – Scope of trace JIT evolved rapidly, incl. running large-scale apps Summary

© 2011 IBM Corporation Peng Wu Trace Compilation 16 Concluding Remarks Significant advances are made in building real trace systems, but much less was understood about them Trace selection algorithms are easy to implement but hard to reason about, this work offers insights on how to identify common pitfalls of a class of trace selection algorithms and solutions to remedy them Trace compilation offers a drastically different approach to traditional compilation, how does trace compilation compare to method compilation is still an over-arching open question

© 2011 IBM Corporation Peng Wu Trace Compilation 17 BACK UP

© 2011 IBM Corporation Peng Wu Trace Compilation 18 WAS/DayTrader performance Peak performanceJITted code sizeCompilation time Base line method-JIT version: pap3260_26sr _01(SR1)) Blade Center JS22, POWER6 4.0 GHz, 4 cores (8 threads), AIX 6.1 Startup time higher is better shorter is better Trace-JIT is about 10% slower than method-JIT in peak throughput Trace-JIT generates smaller code size with much shorter compilation time

© 2011 IBM Corporation Peng Wu Trace Compilation 19 Comparing Against Simpler Solutions

© 2011 IBM Corporation Peng Wu Trace Compilation 20 Our Related Work