© 2011 IBM Corporation Reducing Trace Selection Footprint for Large- scale Java Applications without Performance Loss Peng Wu, Hiroshige Hayashizaki, Hiroshi.

Slides:

Advertisements

Similar presentations

© 2011 IBM Corporation Reducing Trace Selection Footprint for Large- scale Java Applications without Performance Loss Peng Wu, Hiroshige Hayashizaki, Hiroshi.

Advertisements

Chapter 11 – Virtual Memory Management

Data-Flow Analysis II CS 671 March 13, CS 671 – Spring Data-Flow Analysis Gather conservative, approximate information about what a program.

1 Write Barrier Elision for Concurrent Garbage Collectors Martin T. Vechev Cambridge University David F. Bacon IBM T.J.Watson Research Center.

A Program Transformation For Faster Goal-Directed Search Akash Lal, Shaz Qadeer Microsoft Research.

D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Resurrector: A Tunable Object Lifetime Profiling Technique Guoqing Xu University of California, Irvine OOPSLA’13 Conference Talk 1.

NUMA Tuning for Java Server Applications Mustafa M. Tikir.

Trace-based Just-in-Time Type Specialization for Dynamic Languages Andreas Gal, Brendan Eich, Mike Shaver, David Anderson, David Mandelin, Mohammad R.

Trace-Based Automatic Parallelization in the Jikes RVM Borys Bradel University of Toronto.

Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)

Rajiv Gupta Chen Tian, Min Feng, Vijay Nagarajan Speculative Parallelization of Applications on Multicores.

Compilation Technology October 17, 2005 © 2005 IBM Corporation Software Group Reducing Compilation Overhead in J9/TR Marius Pirvu, Derek Inglis, Vijay.

Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.

Partial Method Compilation using Dynamic Profile Information John Whaley Stanford University October 17, 2001.

Previous finals up on the web page use them as practice problems look at them early.

Incremental Path Profiling Kevin Bierhoff and Laura Hiatt Path ProfilingIncremental ApproachExperimental Results Path profiling counts how often each path.

1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.

1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.

Recursion Chapter 7. Chapter 7: Recursion2 Chapter Objectives To understand how to think recursively To learn how to trace a recursive method To learn.

Recursion Chapter 7. Chapter 7: Recursion2 Chapter Objectives To understand how to think recursively To learn how to trace a recursive method To learn.

1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.

Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.

Flexible Reference-Counting-Based Hardware Acceleration for Garbage Collection José A. Joao * Onur Mutlu ‡ Yale N. Patt * * HPS Research Group University.

By Ravi Shankar Dubasi Sivani Kavuri A Popularity-Based Prediction Model for Web Prefetching.

5.3 Machine-Independent Compiler Features

Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.

Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.

Recursion Chapter 7. Chapter Objectives  To understand how to think recursively  To learn how to trace a recursive method  To learn how to write recursive.

Stephen P. Carl - CS 2421 Recursion Reading : Chapter 4.

Fast Conservative Garbage Collection Rifat Shahriyar Stephen M. Blackburn Australian National University Kathryn S. M cKinley Microsoft Research.

Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.

Scalable Web Server on Heterogeneous Cluster CHEN Ge.

Buffered dynamic run-time profiling of arbitrary data for Virtual Machines which employ interpreter and Just-In-Time (JIT) compiler Compiler workshop ’08.

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

IBM Software Group, Compilation Technology © 2007 IBM Corporation Some Challenges Facing Effective Native Code Compilation in a Modern Just-In-Time Compiler.

Code Size Efficiency in Global Scheduling for ILP Processors TINKER Research Group Department of Electrical & Computer Engineering North Carolina State.

Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

Data Structures R e c u r s i o n. Recursive Thinking Recursion is a problem-solving approach that can be used to generate simple solutions to certain.

Mark Marron IMDEA-Software (Madrid, Spain) 1.

1 Branch and Bound Searching Strategies Updated: 12/27/2010.

Practical Path Profiling for Dynamic Optimizers Michael Bond, UT Austin Kathryn McKinley, UT Austin.

Trace Fragment Selection within Method- based JVMs Duane Merrill Kim Hazelwood VEE ‘08.

A Region-Based Compilation Technique for a Java Just-In-Time Compiler Toshio Suganuma, Toshiaki Yasue and Toshio Nakatani Presenter: Ioana Burcea.

Custom Computing Machines for the Set Covering Problem Paper Written By: Christian Plessl and Marco Platzner Swiss Federal Institute of Technology, 2002.

Sporadic model building for efficiency enhancement of the hierarchical BOA Genetic Programming and Evolvable Machines (2008) 9: Martin Pelikan, Kumara.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Method Profiling John Cavazos University.

Profile-Guided Code Positioning See paper of the same name by Karl Pettis & Robert C. Hansen in PLDI 90, SIGPLAN Notices 25(6), pages 16–27 Copyright 2011,

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

1 ROGUE Dynamic Optimization Framework Using Pin Vijay Janapa Reddi PhD. Candidate - Electrical And Computer Engineering University of Colorado at Boulder.

GC Assertions: Using the Garbage Collector To Check Heap Properties Samuel Z. Guyer Tufts University Edward Aftandilian Tufts University.

Profile Guided Code Positioning C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.

Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore

Static Single Assignment

Feedback directed optimization in Compaq’s compilation tools for Alpha

A Practical Stride Prefetching Implementation in Global Optimizer

CMSC 611: Advanced Computer Architecture

Towards JIT compiler for IO language Dynamic mixin optimization

Correcting the Dynamic Call Graph Using Control Flow Constraints

Lecture 9 Dynamic Compilation

Trace-based Just-in-Time Type Specialization for Dynamic Languages

José A. Joao* Onur Mutlu‡ Yale N. Patt*

CMSC 611: Advanced Computer Architecture

CSc 453 Interpreters & Interpretation

Nikola Grcevski Testarossa JIT Compiler IBM Toronto Lab

Presentation transcript:

© 2011 IBM Corporation Reducing Trace Selection Footprint for Large- scale Java Applications without Performance Loss Peng Wu, Hiroshige Hayashizaki, Hiroshi Inoue, and Toshio Nakatani IBM Research Peng Wu 5/24/2015

© 2011 IBM Corporation Peng Wu Trace Compilation 2 Trace selection: how to form good compilation scope Trace-based Compilation in a Nut-shell  Stems from a simple idea of building compilation scopes dynamically out of execution paths method f method entry trace exit return if (x != 0) rarely executed while (!end) do something frequently executed Optimization: scope-mismatch problem Common traps to misunderstand trace selection: Do not think about path profiling Think about trace recording Do not think about program structures Think about graph, path, split or join Do not think about global decisions Think about local decisions Code-gen: handle to handle trace exits

© 2011 IBM Corporation Peng Wu Trace Compilation 3 Trace Compilation in a Decade Loops All regions Coarse grained Loops One-pass trace selection (linear/cyclic traces) Multi-pass trace selection (trace trees) dynamo (binary) PyPy (Python) LuaJIT (Lua) Testarossa Trace-JIT (Java) Hotspot Trace-JIT (Java) SPUR (javascript) HotpathVM (Java) TraceMonkey (javascript) Increasing selection footprint DaCapo-9.12, WebSphere 1300~27000 traces spec <200 traces DaCapo traces, 1600 trees Java Grande <10 trees <600 traces <100 traces <70 trees <200 traces <100 trees YETI (Java) SpecJVM A A B B exit linear stub A A B B exit cyclic stub A A D D exit tree stub C C D D B B Delvik (Java)

© 2011 IBM Corporation Peng Wu Trace Compilation 4 An Example of Trace Duplication Problem Trace ATrace B Trace D Trace C In total, 4 traces (17BBs) are selected for a simple loop of 4BB+1BB Average BB duplication factor on DaCapo is 13

© 2011 IBM Corporation Peng Wu Trace Compilation 5 Understanding the Causes (I): Short-Lived Traces % traces selected by baseline algorithm with <500 execution frequency On average, 40% traces of DaCapo 9-12 are short lived trace A trace B 1.Trace A is formed before trace B, but node B dominates node A 2.Node A is part of trace B Trace A is formed first Trace B is formed later Afterwards, A is no longer entered SYMPTON ROOT CAUSE 1 2

© 2011 IBM Corporation Peng Wu Trace Compilation 6 Understanding the Causes (II): Excessive Duplication Problem  Block duplication is inherent to any trace selection algorithm –e.g., most blocks following any join-node are duplicated on traces  All trace selection algorithms have mechanisms to detect repetition –so that cyclic paths are not unrolled (excessively)  But there are still many unnecessary duplications that do not help performance

© 2011 IBM Corporation Peng Wu Trace Compilation 7 Examples of Excessive Duplication Problem Example 1 Key: this is a very biased join-node Example 2 n trace buffer Q: breaking up a cyclic trace at inner-join point? Q: truncate trace at buffer length (n)? Hint: efficient to peel 1st iteration of a loop? Hint: what’s the convergence of tracing large loop body of size m (m>n)?

© 2011 IBM Corporation Peng Wu Trace Compilation 8 1.Trace A and B are selected out of sync wrt topological order 2.Node A is part of trace B ROOT CAUSE A B Our Solution  Reduce short-lived traces 1.Constructing precise BB –address a common pathological duplication in trace termination conditions 2.Change how trace head selection is done (most effective) –address out-of-order trace head selection 3.Clearing counters along recorded trace –favors the 1 st born 4.Trace path profiling –limit the negative effect of trace duplication  Reduce excessive trace duplication 1.Structure-based truncation –Truncate at biased join-node (e.g., target of back-edge), etc 2.Profile-based truncation –Truncated tail of traces with low utilization based on trace profiling

© 2011 IBM Corporation Peng Wu Trace Compilation 9 Technique Example (I): Trace Path Profiling 1. Select promising BBs to monitor exec. count basic block 2. Selected a trace head, start recording a trace 3. Recorded a trace, then submit to compilation Original trace selection algorithm With trace path profiling 3.a. Keep on interpreting the (nursery) trace – monitor counts of trace entry and exits – do not update yellow counters on trace NOTE: Traces that never graduate from nursery are short-lived by definition 3.b. When trace entry count exceeds threshold, graduate trace from nursery and compile Using nursery to select the topologically early one (i.e., favors “strongest”)

© 2011 IBM Corporation Peng Wu Trace Compilation 10 Evaluation Setup  Benchmark –DaCapo benchmark suite 9.12 –DayTrader 2.0 running on WebSphere 7 (3-tier setup, DB2 and client on a separate machine)  Our Trace-JIT –Extended IBM J9 JIT/VM to support trace compilation based on JDK for Java 6 (32-bit) support a subset of warm level optimizations in original J9 JIT 512 MB Java heap with large page enabled, generational GC –Steady-state performance of the baseline DaCapo: 4% slower than J9 JIT at full opt level DayTrader: 20% slower than J9 JIT at full opt level  Hardware: IBM BladeCenter JS22 –4 cores (8 SMT threads) of POWER6 4.0GHz –16 GB system memory

© 2011 IBM Corporation Peng Wu Trace Compilation 11 Trace Selection Footprint after Applying Individual Techniques (normalized to baseline trace-JIT w/o any optimizations) Trace selection footprint: sum of bytecode sizes among all trace selected Lower is better Observation: each individual technique reduces selection footprint between 10%~40%.

© 2011 IBM Corporation Peng Wu Trace Compilation 12 Cumulative Effect of Individual Techniques on Trace Selection Footprint (Normalized to Baseline) Lower is better Observations: 1) each technique further improves selection footprint over previous techniques; 2) Cumulatively they reduce selection footprint to 30% of the baseline. steady-state time: unchanged, from 4% slowdown (luindex) to 10% speedup (WebSphere) start-up time: 57% baseline compilation time: 31% baseline binary size: 31% baseline

© 2011 IBM Corporation Peng Wu Trace Compilation 13 Comparison with Other Size-control Heuristics  We are the first to explicitly study selection footprint as a problem  However, size control heuristics were used in other selection algorithms –Stop-at-loop-header (3% slower, 150% larger than ours) –Stop-at-return-from-method-of-trace-head (6% slower, 60% larger than ours) –Stop-at-existing-head (30% slower, 20% smaller than ours) Why is stop-at-existing-head so footprint efficient? – It does not form short-lived traces because a trace head cannot appear in another trace – It includes stop-at-loop-header because most loop headers become trace head Why is stop-at-existing-head so footprint efficient? – It does not form short-lived traces because a trace head cannot appear in another trace – It includes stop-at-loop-header because most loop headers become trace head A B

© 2011 IBM Corporation Peng Wu Trace Compilation 14 Comparing Against Simpler Solutions

© 2011 IBM Corporation Peng Wu Trace Compilation Trace selection is more footprint efficient as only live codes are selected 3. Tail duplication is the major source of trace duplication 4. Shortening individual traces is the main weapon for footprint efficiency Common beliefs Our Grain of Salt – Duplication can lead to serious selection footprint explosion – There are other sources of unnecessary duplication: short-lived traces and poor selection convergence – Many trace shortening heuristics hurt performance – Proposed other means to curb footprint at no cost of performance – Many trace shortening heuristics hurt performance – Proposed other means to curb footprint at no cost of performance 1. Selection footprint is a non-issue as trace JITs target hot codes only – Scope of trace JIT evolved rapidly, incl. running large-scale apps Summary

© 2011 IBM Corporation Peng Wu Trace Compilation 16 WAS/DayTrader performance Peak performanceJITted code sizeCompilation time Base line method-JIT version: pap3260_26sr _01(SR1)) Blade Center JS22, POWER6 4.0 GHz, 4 cores (8 threads), AIX 6.1 Startup time higher is better shorter is better  Trace-JIT is about 10% slower than method-JIT in peak throughput Trace-JIT generates smaller code size with much shorter compilation time

© 2011 IBM Corporation Peng Wu Trace Compilation 17 Concluding Remarks & Future Directions  Significant advances are made in building real trace systems, but much less was understood about them  This work offers insights on how to identify common pitfalls of a class of trace selection algorithms and solutions to remedy them  Trace compilation vs. method compilation remains an over-arching open question

© 2011 IBM Corporation Peng Wu Trace Compilation 18 BACK UP

© 2011 IBM Corporation Peng Wu Trace Compilation 19 Breakdown of Source of Selection Footprint Reduction Most footprint reduction comes from eliminating short-lived traces Other reduction may come from better convergence of trace selection

© 2011 IBM Corporation Peng Wu Trace Compilation 20 Our Related Work