Download presentation
Presentation is loading. Please wait.
1
Runtime Specialization With Optimistic Heap Analysis AJ Shankar UC Berkeley Ras BodikSubbu SastryJim Smith UC BerkeleyUW Madison
2
2 Code’ Code Specialization (partial evaluation) Constant Input Variable Input Specializer Hardcode constant values directly into the code Big speedups (100%+) possible But hard to make useable… Output
3
3 First practical specializer Automatic: no manual annotations Dynamic: no offline phase Easy to deploy: hidden in a JIT compiler Powerful: precisely finds all heap constants Fast: under 1s, low overheads
4
4 Specializer: what would benefit? Any program that relies heavily on data that is (largely) constant at runtime For this talk, we’ll focus on one domain But we’ve benchmarked several Speedups of 20% to 500%
5
5 The local bookstore… JavaScript LISP MatlabPerl Python Ruby Scheme Visual Basic
6
6 Interpreters Interpreters: preferred implementation Easy to write Verifiable: interpreter is close to the language spec Deployable: easily portable Programmer-friendly: enable rapid development cycle More scripting languages to come More interpreters to appear
7
7 But interpreters are slow Programmers complain about interpreter speed 20 open Mozilla bugs decrying slow JavaScript Google searches: “python slow”: 674k “visual basic slow”: 3.1M “perl slow”: 810k (“perl porn”: 236k) Compiler? Time-consuming to write, maintain, debug Programmers often don’t want one
8
8 Specialization of an interpreter Goal: Make interpreters fast, easily and for free Code Constant Input Variable Input Output
9
9 P ”native” JVM Specialization of an interpreter Goal: Make interpreters fast, easily and for free Perl Interpreter Perl program P Input to P, other state Specializer JIT Compiler So how come no one actually does this? Output
10
10 A Brief History of Specialization Early specialization (or partial evaluation) Operated on whole programs Required functional languages Hand-directed Recent results Specialize imperative languages like C (Tempo, DyC) … Even if only a code fragment is specializable Reduced annotation burden (Calpa, Suganuma et al.) Profile-based (Suganuma) But challenges remain…
11
11 Specialization Overview Interpret() { pc = oldpc+1; if (pc == 7) if (pc == 10) switch (instr[pc]) { … … } } LD pc == 10 pc == 7 1 2 3 LD 1.Where to specialize? 2.What heap values are constant? 3.When are assumed constants changed? 1 2 3
12
12 Existing solutions What code to specialize? Current systems use annotations But annotations imprecise and barriers to acceptance What heap values can we use as constants? Heap provides bulk of speedup (500% vs 5% without) Annotations: imprecise, not input-specific How to invalidate optimistic assumptions? Optimism good for better specialization Current solutions unsound or untested
13
13 Our Solution: Dynamic Analysis Precise: can specialize on This execution’s input Partially invariant data structures Fast: online sample-based profiling has low overhead Deployable: transparent, sits in a JIT compiler Just write your program in Java/C# Simple to implement: let VM do the drudge work Code generation, profiling, constant propagation, recompilation, on-stack replacement
14
14 Algorithm 1. Find a specialization starting point e pc = FindSpecPoint(hot_function) 2. Specialize: create a trace t(e pc, k) for each hot value k Constant propagation, modified: Assume e pc = k Eliminate loads from invariant memory locations Replace x := load loc with x = mem[loc] if Invariant(loc) Create a trace, not a CFG Loops unrolled, branch prediction for non-constant conditionals Eliminates safety checks, dynamic dispatch, etc. too Modify dispatch at pc to select trace t when e pc = k 3. Invalidate Let S be the set of assumed invariant locations If Updated(loc) where loc S invalidate 1 2 3
15
15 Solution 1: FindSpecPoint Where to start a specialized trace? The best point can be near the end of the function Ideally: try to specialize from all instructions Pick the best one But too slow for large functions Local heuristics inconsistent, inaccurate Execution frequency, value hotness, CFG properties Need an efficient global algorithm Should come up with a few good candidates
16
16 FindSpecPoint: Influence If e pc = k, how many dynamic instructions can we specialize away? Most precise: actually specialize Upper bound: forward dynamic slice of e pc Too costly for an online environment Our solution: Influence: upper bound of dynamic slice Dataflow-independent Def: Influence(e) = Expected number of dynamic instructions from the first occurrence of e pc to the end of the function System of equations, solved in linear time
17
17 Influence example 30 25.227.2 28.4.6 Influence consistently selects the best specialization points 40%?60%? Not quite….9.87.94 1.Probability of ever reaching instruction How often will trace be executed? 2.Length of dynamic trace from instruction to end How much benefit obtainable? Can approximate 1 and 2 by… 3. Expected trace length to end = Influence
18
18 Solution 2: Invariant(loc) Primary issue: would like to know what memory locations are invariant Provides the bulk of the speedup Existing work relied on static analysis or annotations Our solution: sampled invariance profiling Track every nth store Locations detected as written: not constant Everything else: optimistically assumed constant 95.6% of claimed constants remained constant
19
19 Profiling, cont’d Use Arnold-Ryder duplication-based sampling to gather other useful info CFG edge execution frequencies Helps identify good trace start points (influence) Hot values at particular program points Helps seed the constant propagator with initial values
20
20 Solution 3: Invalidation Our heap analysis is optimistic We need to guard assumed constant locations And invalidate corresponding traces Our solution to the two key problems: Detect when such a location is updated Use write barriers (type information eliminates most barriers) Overhead: ~6% << specialization benefit Invalidate corresponding specialized traces A bit tricky: trace may need to be invalidated while executing See paper for our solution
21
21 Experimental evaluation Implemented in JikesRVM Does the specializer work? Benchmarked real-world programs, existing specialization kernels Is it suitable for a runtime environment? Benchmarked programs unsuitable for specialization Measured overheads Does it exploit opportunities unavailable to other specializers? Looked at specific specializations for evidence
22
22 Results BenchmarkInputSpeed convolve Transforms an image with a matrix; from the ImageJ toolkit fixed image, various matrices2.74x fixed matrix, various images1.23x dotproduct Converted from C version in DyC sparse constant vector5.17x interpreter Interprets simple bytecodes bubblesort bytecodes5.96x binary search bytecodes6.44x jscheme Interprets Scheme code partial evaluator1.82x query Performs a database query; from DyC semi-invariant query1.71x sim8085 Intel 8085 Microprocessor simulator included sample program1.70x em3d (intentionally unspecializable) Electromagnetic wave propagation -n 10000 -d 1000.98x
23
23 Suitable for runtime environment? Fully transparent Low overheads, dwarfed by speedups Profiling overhead range: 0.1% - 19.8% Specialization time average: 0.7s Invalidation barrier overhead average: 4% See paper for extensive breakdown of overheads Overhead on unspecializable programs < 6%
24
24 Runtime-only opportunties? Convolve specialized in two different ways For two different inputs Query specialized on partially invariant structure Interpreter specialized on constant locations in interpreted program 23% of dynamic loads from interpreted address space were constant; an additional 9.6% of all loads in interpreter’s execution were eliminated No distinction between address “spaces”
25
25 The end is the beginning (is the end) I’ve presented a new specializer that Is totally transparent Exposes new specialization opportunities Is easy to throw into a JVM
26
26 Does the specializer work? Similar speedups to existing specializers And on similar benchmarks With no annotations or offline phase Ran on real-world programs Jscheme is a real interpreter Interpreting a 500-line partial evaluator (ha!)
27
27 Practical Specialization We want the following properties: Automatically identify “constant” inputs Automatically identify specializable code Ensuring soundness if “constants” change Some barriers to acceptance in the past Manual program annotations to specify constants Offline analysis Inefficient or incomplete soundness guarantees
28
28 Challenge 1: What code to specialize? Requires programmer annotations (DyC, Tempo) Input not available at annotation time No transparency: involves the programmer A real roadblock to acceptance … or offline annotation inference (Calpa) Input not available at inference time Abstraction in static analysis dilutes precision Too slow for JIT compilers … or specialize the whole method (Suganuma)
29
29 Challenge 2: Heap constants Which heap locations don’t change at run time? Annotations Static analysis Or greatly restrict heap usage (Suganuma) Heap analysis is hard but very beneficial… 5% speedup with Suganuma vs. 500% using full heap
30
30 Challenge 3: Invalidation Can specialize better if optimistic: Assume that some memory locations don’t change How to check invalidation of these assumptions? Programmer inserts invalidations Possibly unsound Pointer analysis Likely high overhead No evaluation in the literature
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.