Runtime Specialization With Optimistic Heap Analysis AJ Shankar UC Berkeley Ras BodikSubbu SastryJim Smith UC BerkeleyUW Madison.

Slides:

Advertisements

Similar presentations

Background Virtual memory – separation of user logical memory from physical memory. Only part of the program needs to be in memory for execution. Logical.

Advertisements

The Case for a SC-preserving Compiler Madan Musuvathi Microsoft Research Dan Marino Todd Millstein UCLA University of Michigan Abhay Singh Satish Narayanasamy.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.

Dynamic Typing COS 441 Princeton University Fall 2004.

Fully Dynamic Specialization AJ Shankar OSQ Lunch 9 December 2003.

Program Representations. Representing programs Goals.

Program Slicing Mark Weiser and Precise Dynamic Slicing Algorithms Xiangyu Zhang, Rajiv Gupta & Youtao Zhang Presented by Harini Ramaprasad.

Learning Objectives Explain similarities and differences among algorithms, programs, and heuristic solutions List the five essential properties of an algorithm.

Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.

ABCD: Eliminating Array-Bounds Checks on Demand Rastislav Bodík Rajiv Gupta Vivek Sarkar U of Wisconsin U of Arizona IBM TJ Watson recent experiments.

1 S. Tallam, R. Gupta, and X. Zhang PACT 2005 Extended Whole Program Paths Sriraman Tallam Rajiv Gupta Xiangyu Zhang University of Arizona.

Trace-based Just-in-Time Type Specialization for Dynamic Languages Andreas Gal, Brendan Eich, Mike Shaver, David Anderson, David Mandelin, Mohammad R.

Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)

Eliminating Stack Overflow by Abstract Interpretation John Regehr Alastair Reid Kirk Webb University of Utah.

Karl Schnaitter and Neoklis Polyzotis (UC Santa Cruz) Serge Abiteboul (INRIA and University of Paris 11) Tova Milo (University of Tel Aviv) Automatic Index.

Common Sub-expression Elim Want to compute when an expression is available in a var Domain:

Recap from last time We were trying to do Common Subexpression Elimination Compute expressions that are available at each program point.

Vertically Integrated Analysis and Transformation for Embedded Software John Regehr University of Utah.

Fast Effective Dynamic Compilation Joel Auslander, Mathai Philipose, Craig Chambers, etc. PLDI’96 Department of Computer Science and Engineering Univ.

Memory Systems Performance Workshop 2004© David Ryan Koes MSP 2004 Programmer Specified Pointer Independence David Koes Mihai Budiu Girish Venkataramani.

Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.

Previous finals up on the web page use them as practice problems look at them early.

An Adaptive, Region-based Allocator for Java Feng Qian & Laurie Hendren 2002.

Transparent Program Specialization AJ Shankar OSQ Retreat, Spring 2003.

Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham

Recap from last time: live variables x := 5 y := x + 2 x := x + 1 y := x y...

1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.

Direction of analysis Although constraints are not directional, flow functions are All flow functions we have seen so far are in the forward direction.

Names and Bindings Introduction Names Variables The concept of binding Chapter 5-a.

Composing Dataflow Analyses and Transformations Sorin Lerner (University of Washington) David Grove (IBM T.J. Watson) Craig Chambers (University of Washington)

CISC673 – Optimizing Compilers1/34 Presented by: Sameer Kulkarni Dept of Computer & Information Sciences University of Delaware Phase Ordering.

Precision Going back to constant prop, in what cases would we lose precision?

1 Chapter-01 Introduction to Computers and C++ Programming.

CS527: (Advanced) Topics in Software Engineering Overview of Software Quality Assurance Tao Xie ©D. Marinov, T. Xie.

CS 355 – Programming Languages

Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

High level & Low level language High level programming languages are more structured, are closer to spoken language and are more intuitive than low level.

Fast, Effective Code Generation in a Just-In-Time Java Compiler Rejin P. James & Roshan C. Subudhi CSE Department USC, Columbia.

CSC3315 (Spring 2009)1 CSC 3315 Programming Languages Hamid Harroud School of Science and Engineering, Akhawayn University

1 Specialization Tools and Techniques for Systematic Optimization of System Software McNamee, Walpole, Pu, Cowan, Krasic, Goel, Wagle, Consel, Muller,

Java Virtual Machine Case Study on the Design of JikesRVM.

Buffered dynamic run-time profiling of arbitrary data for Virtual Machines which employ interpreter and Just-In-Time (JIT) compiler Compiler workshop ’08.

1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.

Dynamic Object Sampling for Pretenuring Maria Jump Department of Computer Sciences The University of Texas at Austin Stephen M. Blackburn.

Introduction to Problem Solving. Steps in Programming A Very Simplified Picture –Problem Definition & Analysis – High Level Strategy for a solution –Arriving.

Online partial evaluation of bytecodes (3)

ABCD: Eliminating Array-Bounds Checks on Demand Rastislav Bodík Rajiv Gupta Vivek Sarkar U of Wisconsin U of Arizona IBM TJ Watson recent experiments.

Specialization Tools and Techniques for Systematic Optimization of System Software Presented By: Ashwini Kulkarni Operating Systems Winter 2006.

Prolog Program Style (ch. 8) Many style issues are applicable to any program in any language. Many style issues are applicable to any program in any language.

“Isolating Failure Causes through Test Case Generation “ Jeremias Rößler Gordon Fraser Andreas Zeller Alessandro Orso Presented by John-Paul Ore.

Sorting: Implementation Fundamental Data Structures and Algorithms Klaus Sutner February 24, 2004.

Design - programming Cmpe 450 Fall Dynamic Analysis Software quality Design carefully from the start Simple and clean Fewer errors Finding errors.

Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.

Concepts of programming languages Chapter 5 Names, Bindings, and Scopes Lec. 12 Lecturer: Dr. Emad Nabil 1-1.

NETW3005 Virtual Memory. Reading For this lecture, you should have read Chapter 9 (Sections 1-7). NETW3005 (Operating Systems) Lecture 08 - Virtual Memory2.

Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)

ECE 750 Topic 8 Meta-programming languages, systems, and applications Automatic Program Specialization for J ava – U. P. Schultz, J. L. Lawall, C. Consel.

Eliminating External Fragmentation in a Non-Moving Garbage Collector for Java Author: Fridtjof Siebert, CASES 2000 Michael Sallas Object-Oriented Languages.

COMP 2100 From Python to Java

Effective Data-Race Detection for the Kernel

Adaptive Code Unloading for Resource-Constrained JVMs

Inlining and Devirtualization Hal Perkins Autumn 2011

Calpa: A Tool for Automating Dynamic Compilation

Trace-based Just-in-Time Type Specialization for Dynamic Languages

1.3.7 High- and low-level languages and their translators

CSc 453 Interpreters & Interpretation

Dynamic Binary Translators and Instrumenters

Pointer analysis John Rollinson & Kaiyuan Li

Presentation transcript:

Runtime Specialization With Optimistic Heap Analysis AJ Shankar UC Berkeley Ras BodikSubbu SastryJim Smith UC BerkeleyUW Madison

2 Code’ Code Specialization (partial evaluation) Constant Input Variable Input Specializer Hardcode constant values directly into the code Big speedups (100%+) possible But hard to make useable… Output

3 First practical specializer Automatic: no manual annotations Dynamic: no offline phase Easy to deploy: hidden in a JIT compiler Powerful: precisely finds all heap constants Fast: under 1s, low overheads

4 Specializer: what would benefit? Any program that relies heavily on data that is (largely) constant at runtime For this talk, we’ll focus on one domain But we’ve benchmarked several  Speedups of 20% to 500%

5 The local bookstore… JavaScript LISP MatlabPerl Python Ruby Scheme Visual Basic

6 Interpreters Interpreters: preferred implementation  Easy to write  Verifiable: interpreter is close to the language spec  Deployable: easily portable  Programmer-friendly: enable rapid development cycle More scripting languages to come  More interpreters to appear

7 But interpreters are slow Programmers complain about interpreter speed  20 open Mozilla bugs decrying slow JavaScript Google searches:  “python slow”: 674k  “visual basic slow”: 3.1M  “perl slow”: 810k  (“perl porn”: 236k) Compiler?  Time-consuming to write, maintain, debug  Programmers often don’t want one

8 Specialization of an interpreter  Goal: Make interpreters fast, easily and for free Code Constant Input Variable Input Output

9 P ”native” JVM Specialization of an interpreter  Goal: Make interpreters fast, easily and for free Perl Interpreter Perl program P Input to P, other state Specializer JIT Compiler So how come no one actually does this? Output

10 A Brief History of Specialization Early specialization (or partial evaluation)  Operated on whole programs  Required functional languages  Hand-directed Recent results  Specialize imperative languages like C (Tempo, DyC)  … Even if only a code fragment is specializable  Reduced annotation burden (Calpa, Suganuma et al.)  Profile-based (Suganuma) But challenges remain…

11 Specialization Overview Interpret() { pc = oldpc+1; if (pc == 7) if (pc == 10) switch (instr[pc]) { … … } } LD pc == 10 pc == LD 1.Where to specialize? 2.What heap values are constant? 3.When are assumed constants changed? 1 2 3

12 Existing solutions What code to specialize?  Current systems use annotations  But annotations imprecise and barriers to acceptance What heap values can we use as constants?  Heap provides bulk of speedup (500% vs 5% without)  Annotations: imprecise, not input-specific How to invalidate optimistic assumptions?  Optimism good for better specialization  Current solutions unsound or untested

13 Our Solution: Dynamic Analysis Precise: can specialize on  This execution’s input  Partially invariant data structures Fast: online sample-based profiling has low overhead Deployable: transparent, sits in a JIT compiler  Just write your program in Java/C# Simple to implement: let VM do the drudge work  Code generation, profiling, constant propagation, recompilation, on-stack replacement

14 Algorithm 1. Find a specialization starting point e pc = FindSpecPoint(hot_function) 2. Specialize: create a trace t(e pc, k) for each hot value k Constant propagation, modified: Assume e pc = k Eliminate loads from invariant memory locations Replace x := load loc with x = mem[loc] if Invariant(loc) Create a trace, not a CFG Loops unrolled, branch prediction for non-constant conditionals Eliminates safety checks, dynamic dispatch, etc. too Modify dispatch at pc to select trace t when e pc = k 3. Invalidate Let S be the set of assumed invariant locations If Updated(loc) where loc  S  invalidate 1 2 3

15 Solution 1: FindSpecPoint Where to start a specialized trace?  The best point can be near the end of the function Ideally: try to specialize from all instructions  Pick the best one  But too slow for large functions Local heuristics inconsistent, inaccurate  Execution frequency, value hotness, CFG properties Need an efficient global algorithm  Should come up with a few good candidates

16 FindSpecPoint: Influence If e pc = k, how many dynamic instructions can we specialize away?  Most precise: actually specialize  Upper bound: forward dynamic slice of e pc Too costly for an online environment  Our solution: Influence: upper bound of dynamic slice Dataflow-independent Def: Influence(e) = Expected number of dynamic instructions from the first occurrence of e pc to the end of the function System of equations, solved in linear time

17 Influence example Influence consistently selects the best specialization points 40%?60%? Not quite… Probability of ever reaching instruction How often will trace be executed? 2.Length of dynamic trace from instruction to end How much benefit obtainable? Can approximate 1 and 2 by… 3. Expected trace length to end = Influence

18 Solution 2: Invariant(loc) Primary issue: would like to know what memory locations are invariant  Provides the bulk of the speedup  Existing work relied on static analysis or annotations Our solution: sampled invariance profiling  Track every nth store  Locations detected as written: not constant  Everything else: optimistically assumed constant 95.6% of claimed constants remained constant

19 Profiling, cont’d Use Arnold-Ryder duplication-based sampling to gather other useful info  CFG edge execution frequencies Helps identify good trace start points (influence)  Hot values at particular program points Helps seed the constant propagator with initial values

20 Solution 3: Invalidation Our heap analysis is optimistic  We need to guard assumed constant locations  And invalidate corresponding traces Our solution to the two key problems:  Detect when such a location is updated Use write barriers (type information eliminates most barriers) Overhead: ~6% << specialization benefit  Invalidate corresponding specialized traces A bit tricky: trace may need to be invalidated while executing See paper for our solution

21 Experimental evaluation Implemented in JikesRVM Does the specializer work?  Benchmarked real-world programs, existing specialization kernels Is it suitable for a runtime environment?  Benchmarked programs unsuitable for specialization  Measured overheads Does it exploit opportunities unavailable to other specializers?  Looked at specific specializations for evidence

22 Results BenchmarkInputSpeed convolve Transforms an image with a matrix; from the ImageJ toolkit fixed image, various matrices2.74x fixed matrix, various images1.23x dotproduct Converted from C version in DyC sparse constant vector5.17x interpreter Interprets simple bytecodes bubblesort bytecodes5.96x binary search bytecodes6.44x jscheme Interprets Scheme code partial evaluator1.82x query Performs a database query; from DyC semi-invariant query1.71x sim8085 Intel 8085 Microprocessor simulator included sample program1.70x em3d (intentionally unspecializable) Electromagnetic wave propagation -n d x

23 Suitable for runtime environment? Fully transparent Low overheads, dwarfed by speedups  Profiling overhead range: 0.1% %  Specialization time average: 0.7s  Invalidation barrier overhead average: 4%  See paper for extensive breakdown of overheads Overhead on unspecializable programs < 6%

24 Runtime-only opportunties? Convolve specialized in two different ways  For two different inputs Query specialized on partially invariant structure Interpreter specialized on constant locations in interpreted program  23% of dynamic loads from interpreted address space were constant; an additional 9.6% of all loads in interpreter’s execution were eliminated  No distinction between address “spaces”

25 The end is the beginning (is the end) I’ve presented a new specializer that  Is totally transparent  Exposes new specialization opportunities  Is easy to throw into a JVM

26 Does the specializer work? Similar speedups to existing specializers  And on similar benchmarks  With no annotations or offline phase Ran on real-world programs  Jscheme is a real interpreter  Interpreting a 500-line partial evaluator (ha!)

27 Practical Specialization We want the following properties:  Automatically identify “constant” inputs  Automatically identify specializable code  Ensuring soundness if “constants” change Some barriers to acceptance in the past  Manual program annotations to specify constants  Offline analysis  Inefficient or incomplete soundness guarantees

28 Challenge 1: What code to specialize? Requires programmer annotations (DyC, Tempo)  Input not available at annotation time  No transparency: involves the programmer A real roadblock to acceptance … or offline annotation inference (Calpa)  Input not available at inference time  Abstraction in static analysis dilutes precision  Too slow for JIT compilers … or specialize the whole method (Suganuma)

29 Challenge 2: Heap constants Which heap locations don’t change at run time? Annotations Static analysis Or greatly restrict heap usage (Suganuma)  Heap analysis is hard but very beneficial…  5% speedup with Suganuma vs. 500% using full heap

30 Challenge 3: Invalidation Can specialize better if optimistic:  Assume that some memory locations don’t change How to check invalidation of these assumptions?  Programmer inserts invalidations Possibly unsound  Pointer analysis Likely high overhead No evaluation in the literature