Fully Dynamic Specialization AJ Shankar OSQ Lunch 9 December 2003.

Slides:



Advertisements
Similar presentations
SSA and CPS CS153: Compilers Greg Morrisett. Monadic Form vs CFGs Consider CFG available exp. analysis: statement gen's kill's x:=v 1 p v 2 x:=v 1 p v.
Advertisements

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
A Program Transformation For Faster Goal-Directed Search Akash Lal, Shaz Qadeer Microsoft Research.
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
Chapter 10- Instruction set architectures
Register Allocation CS 671 March 27, CS 671 – Spring Register Allocation - Motivation Consider adding two numbers together: Advantages: Fewer.
Recursion vs. Iteration The original Lisp language was truly a functional language: –Everything was expressed as functions –No local variables –No iteration.
Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.
Dec 5, 2007University of Virginia1 Efficient Dynamic Tainting using Multiple Cores Yan Huang University of Virginia Dec
Program Representations. Representing programs Goals.
Programming Types of Testing.
Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)
API Design CPSC 315 – Programming Studio Fall 2008 Follows Kernighan and Pike, The Practice of Programming and Joshua Bloch’s Library-Centric Software.
Recap from last time We were trying to do Common Subexpression Elimination Compute expressions that are available at each program point.
Representing programs Goals. Representing programs Primary goals –analysis is easy and effective just a few cases to handle directly link related things.
Memory Systems Performance Workshop 2004© David Ryan Koes MSP 2004 Programmer Specified Pointer Independence David Koes Mihai Budiu Girish Venkataramani.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
Transparent Program Specialization AJ Shankar OSQ Retreat, Spring 2003.
MAE 552 – Heuristic Optimization Lecture 5 February 1, 2002.
Direction of analysis Although constraints are not directional, flow functions are All flow functions we have seen so far are in the forward direction.
Chapter 91 Memory Management Chapter 9   Review of process from source to executable (linking, loading, addressing)   General discussion of memory.
Precision Going back to constant prop, in what cases would we lose precision?
Abstraction IS 101Y/CMSC 101 Computational Thinking and Design Tuesday, September 17, 2013 Carolyn Seaman University of Maryland, Baltimore County.
Lecture 22 Miscellaneous Topics 4 + Memory Allocation.
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
“is a”  Define a new class DerivedClass which extends BaseClass class BaseClass { // class contents } class DerivedClass : BaseClass { // class.
Data Flow in Static Profiling Cathal Boogerd, Delft University, The Netherlands Leon Moonen, Simula Research Lab, Norway ?
CS 11 C track: lecture 5 Last week: pointers This week: Pointer arithmetic Arrays and pointers Dynamic memory allocation The stack and the heap.
Abstraction IS 101Y/CMSC 101 Computational Thinking and Design Tuesday, September 17, 2013 Marie desJardins University of Maryland, Baltimore County.
Code Optimization 1 Course Overview PART I: overview material 1Introduction 2Language processors (tombstone diagrams, bootstrapping) 3Architecture of a.
Software Integrity Monitoring Using Hardware Performance Counters Corey Malone.
CP Summer School Modelling for Constraint Programming Barbara Smith 2. Implied Constraints, Optimization, Dominance Rules.
CS 2130 Lecture 5 Storage Classes Scope. C Programming C is not just another programming language C was designed for systems programming like writing.
Online partial evaluation of bytecodes (3)
Compiler Principles Fall Compiler Principles Lecture 0: Local Optimizations Roman Manevich Ben-Gurion University.
Looping and Counting Lecture 3 Hartmut Kaiser
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Introduction Selenium IDE is a Firefox extension that allows you to record, edit, and debug tests for HTML Easy record and playback Intelligent field selection.
Project Presentation By: Dean Morrison 12/6/2006 Dynamically Adaptive Prepaging for Effective Virtual Memory Management.
Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.
How to execute Program structure Variables name, keywords, binding, scope, lifetime Data types – type system – primitives, strings, arrays, hashes – pointers/references.
Static Identification of Delinquent Loads V.M. Panait A. Sasturkar W.-F. Fong.
Register Allocation CS 471 November 12, CS 471 – Fall 2007 Register Allocation - Motivation Consider adding two numbers together: Advantages: Fewer.
M1G Introduction to Programming 2 2. Creating Classes: Game and Player.
Control units In the last lecture, we introduced the basic structure of a control unit, and translated our assembly instructions into a binary representation.
NETW3005 Virtual Memory. Reading For this lecture, you should have read Chapter 9 (Sections 1-7). NETW3005 (Operating Systems) Lecture 08 - Virtual Memory2.
1 ROGUE Dynamic Optimization Framework Using Pin Vijay Janapa Reddi PhD. Candidate - Electrical And Computer Engineering University of Colorado at Boulder.
Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)
LECTURE 19 Subroutines and Parameter Passing. ABSTRACTION Recall: Abstraction is the process by which we can hide larger or more complex code fragments.
CONDITIONALS CITS1001. Scope of this lecture if statements switch statements Source ppts: Objects First with Java - A Practical Introduction using BlueJ,
Some of the utilities associated with the development of programs. These program development tools allow users to write and construct programs that the.
Testing and Debugging.
Assembler Design Options
Princeton University Spring 2016
Chapter 9 :: Subroutines and Control Abstraction
CSCI1600: Embedded and Real Time Software
CSc 453 Interpreters & Interpretation
Inlining and Devirtualization Hal Perkins Autumn 2011
Inlining and Devirtualization Hal Perkins Autumn 2009
Dynamic Scoping Lazy Evaluation
Calpa: A Tool for Automating Dynamic Compilation
Instruction Execution Cycle
6.001 SICP Further Variations on a Scheme
Control units In the last lecture, we introduced the basic structure of a control unit, and translated our assembly instructions into a binary representation.
CSCI1600: Embedded and Real Time Software
CSc 453 Interpreters & Interpretation
Dynamic Binary Translators and Instrumenters
Presentation transcript:

Fully Dynamic Specialization AJ Shankar OSQ Lunch 9 December 2003

“That’s Why They Play the Game” Programs are executed because we can’t determine their behavior statically! Idea: Optimize programs dynamically to take advantage of runtime information we can’t get statically  Look at portions of the program for predictable inputs that we can optimize for

Specialization Recompile portions of the program, using known runtime values as constants  Possibly many variants of the same code  Allow for fallback to original code when assumptions are not met  Predictable == recurrent Generic UnpredictablePredictable GP2P3P4 Unpredictable Predictable

LOAD pcX = … How It Works Chose a good region of code to specialize: after a good predictable instruction Insert dispatch that checks the result of the chosen instruction Recompile code for different results of the instruction During execution, jump to appropriate specialized code Dispatch(X) Spec1Spec2Default …… Dispatch(X) Spec1Spec2Default …… Dispatch(X) Spec1Spec2Default … Rest of Code …

Tying Things Together If Foo is specialized on X And because of X, Y is constant And Foo calls Bar with param Y And Bar is specialized on Y Foo can jump straight to that specialized version of Bar Dispatch Spec_X Bar(Y) Method Foo Dispatch Spec_Y … Method Bar Spec_Z …

When Is This a Good Idea? Any app whose execution is heavily dependent on input For instance  Interpreters  Raytracers  Dynamic content producers (CGI scripts, etc.)

Specialization Is Hard! Specializing code at runtime is costly  Can even slow the program down Existing specializers rely on static annotations to clue them in about profitable areas  Difficult to get right  Limits specialization potential

Existing: DyC, Cyclone, etc. Explicitly annotate static data No support for automatic specialization of frequently-executed code  Could compile lots of useless stuff No concrete store information  Doesn’t take advantage of the fact that memory location X is constant for the lifetime of the program

Existing: Calpa Mock, et al, Extension to DyC. Profile execution on sample input to derive annotations But converting a concrete profile to an abstract annotation means  Still unable to detect concrete memory constants  Frequently executed code for arbitrary input? Still needs source, is offline!

Motivating Example: Interpreter while(1) { i = instrs[pc]; switch(instr.opcode) { case ADD: env[i.res] = env[i.op1] + env[i.op2]; pc++; break; case BNEQ; if (env[i.op1] != 0) pc = env[i.op2]; else pc++; break;... } Sample interpreted program: X = 10; … WHILE (Z != 0) { Y = X+Z; … } X is constant after initialization concrete memory location Y = X+Z executed frequently

Motivating Example: Interpreter while(1) { i = instrs[pc]; switch(instr.opcode) { case ADD: env[i.res] = env[i.op1] + env[i.op2]; pc++; break; case BNEQ; if (env[i.op1] != 0) pc = env[i.op2]; else pc++; break;... } Sample interpreted program: X = 10; … WHILE (Z != 0) { Y = X+Z; … } while(1) { while (pc == 15) { // Y = X + Z env[3] = 10 + env[2]; … // Z != 0 ? if (env[2] == 0) pc = 19; } else { // normal loop }

A More Concrete Approach Do everything at runtime! Specialize on execution-time hot values Know which concrete memory locations are constant Other benefits of this approach:  Specialize temporally, as execution progresses  Specialize dynamically loaded libraries as well  No annotations or source code necessary

LOAD pcX = … LOAD pc A Quick Recap Chose a good region of code to specialize Insert dispatch that checks the result of the chosen instruction (the “trigger”) Recompile code for different values of a hot instruction During execution, jump to appropriate specialized code Dispatch(X) Spec1Spec2Default …… Dispatch(X) Spec1Spec2Default …… Dispatch(X) Spec1Spec2Default …… Dispatch(pc) pc=15pc=27while(1) Rest of Code …

The Details Need to identify the best predictable instruction  Specializing on its result should provide the greatest benefit  To find it, gather profile information about all instructions Need to actually do the specializing

Instrumentation: Hot Values What’s a hot value? One that occurs frequently as the result of an instruction  x % 2 has two very hot values, 0 and 1 Good candidate instructions are predictable: result in (only) a few hot values  For instance, small_constant_table[x], but not rand(x) Case study: Interpreter  Predictable instructions: LOAD pc, instr.opcode instr = instrs[pc]; switch(instr.opcode) { … }

Instrumentation: Store Profile Keep track of memory locations that have been written to Idea: if a location hasn’t been written to yet, it probably won’t be later, either Case study: Interpreter  Store profile says env[Y] written to a lot, but env[X], instrs[] never written to regs[instr.res] = regs[instr.op1] + regs[instr.op2];

Invalidating Specialized Code Memory locations may not really be constant When ‘constant’ memory is overwritten, must invalidate or modify specializations that depended on it How does Calpa handle invalidation?  Computes points-to set  Inserts invalidation calls at all appropriate points (offline)  Too costly an approach, without modification

Invalidation Options Write barrier  Still feasible if field is private On-entry checks  Feasible if specialization depends on a small number of memory locations  e.g. Factor(BigInt x) Hardware support  e.g. Mondrian  Ideal solution  Possible to simulate? Class Interpreter { private Instruction[] instrs; void SetInstrs(Instruction[] is) { instrs = is; } Dispatch Spec1Default Hot Instruction CheckMem Invalidate

Specialization Procedure Recap: We know which instructions are good candidates, what their hot values are, and what parts of memory are likely to be invariant Want to compile different versions of the same block of code relative to a chosen trigger instruction Each version is keyed on a hot value of that instruction What instruction, if any, should be a basis for specialization?

Specialization Algorithm 1. Find good candidate instructions Predictable Frequently executed 2. For each candidate instruction Simultaneously evaluate method using constant propagation for some of its hot values Compute overall cost/benefit 3. Choose the best instruction

Algorithm Pseudo-code foreach(value v in hot values) worklist.push( ); previously_emitted = [ ]; while ( = pop worklist) { = evaluate( ); // uses store information, fixes jumps foreach (n'' in succ(n')) { // have we already seen this node/state pair before? prev_instr = previously_emitted[ ]; if (prev_instr) {// if so, link to it n'.modify_jump_to(n''->prev_instr); } else {// otherwise, keep evaluating worklist.push( ); } instr = emit_instruction(n'); // remember this pair in case we see it again previously_emitted[ ] = instr; }

Specializing the Interpreter while(1) { i = instrs[pc]; switch(instr.opcode) { case ADD: env[i.res] = env[i.op1] + env[i.op2]; pc++; break; case BNEQ; if (env[i.op1] != 0) pc = env[i.op2]; else pc++; break;... } } Instr.opcode: Executed very frequently A small handful of values pc: Executed very frequently More values, but still reasonable Candidates:

switch(ADD) Specializing on instr.opcode LOOP: i = instrs[pc] case ADD: switch(i.opcode) …… env[i.res] = env[i.op1]+env[i.op2] pc = pc + 1 goto LOOP switch(ADD) case ADD: benefit = 1 env[i.res] = env[i.op1]+env[i.op2] pc = pc + 1 goto LOOP LOOP: i = instrs[pc] benefit = 3 benefit = 2 i.opcode = ADD Dispatch(opcode) {} Other values of opcode have similar results…

LOOP: i = instrs[15] Specializing on pc LOOP: i = instrs[pc] case ADD: switch(i.opcode) …… env[i.res] = env[i.op1]+env[i.op2] pc = pc + 1 goto LOOP LOOP: i = instrs[15] switch(ADD) case ADD: env[Y] = 10 + env[Z] pc = LOOP: i = instrs[16] switch(BNEQ) if (env[Z] != 0) pc = 15 pc = 15 ; i = ADD Y, X, Z pc = 16 ; i = ADD Y, X, Z pc = 16 ; i = BNEQ Z, 15 pc++; … Dispatch(pc) benefit = 1 benefit = 2 benefit = 3 benefit = 6 benefit = 7 benefit = 8 benefit = 9 benefit = 10 benefit = … Y = X + Z pc = 16 ; i = BNEQ Z, 15

Final Result Choose to specialize on pc because benefit is far greater than for instr.opcode Generate different versions for each of the hottest values of pc Terminate loop unrolling either naturally (when we don’t know what pc is anymore) or with a simple heuristic

Heuristics Algorithm may not terminate when unrolling loops  Simple heuristic: widen variables when we’ve seen the same node, say, 10 times (or use frequency statistics) Algorithm may generate lots of code  Need to only look at parts of state that matter  Widen somewhere… Other issues: Algorithm may be slow  Need better way to prune off bad candidates

Implementation Ideas Use Dynamo  Hot trace as basis for specialization  Intuitively, follow the lifetime of an object as it travels through the program across function boundaries  Unfortunately, closed-source, and API isn’t expressive enough

Implementation Ideas JikesRVM  Java VM written in Java  Has a primitive framework for sampling  Has a fairly sophisticated framework for dynamic recompilation  Does aggressive inlining  Only instrument hot traces (but compiler is slow…)