Online partial evaluation of bytecodes (3)

Online partial evaluation of bytecodes (3) rei@415

Run-time Optimization Overview of Specialization The DyC System The Dynamo System Run-time Specialization for ML

Specialization Program Input1Input2 Output Program Input1Input2 Output ProgramSpecializer Input1 ProgramInput1Output Input2 ProgramInput1 is a specialized program Specializer performs specialization of Program with respect to Input1 add Specializer 1 succ3 2

Flow-Chart Language n m (init) init:result := 1; goto test; test:if n < 1 then end else loop loop:result := result * m; n =: n – 1 goto test: end:return result; 5252 52

Flow-Chart Language n m (init) init:result := 1; goto test; test:if n < 1 then end else loop loop:result := result * m; n =: n – 1 goto test: end:return result; ?2?2 ?2

Flow-Chart Language n m (init) init:result := 1; goto test; test:if n < 1 then end else loop loop:result := result * m; n =: n – 1 goto test: end:return result; Program2 ?2?2 ?2

Flow-Chart Language n m (init) init:result := 1; goto test; test:if n < 1 then end else loop loop:result := result * m; n =: n – 1 goto test: end:return result; Program2 ?2?2 ? Specializer 2

Flow-Chart Language Program2 ?2?2 ? m (init) init:goto test0; test0:goto loop0; loop0:result := 1 * m; goto test1; test1:goto loop1; loop1:result := result * m; goto test2; test2:goto end; end:return result;

What is Inside the Specializer? A specializer is a program processor Originally seen a static source-to-source transformation Two families of specializers: Online specialization (one pass), Offline specialization (two passes)

Program Input1 Specializer ProgramInput1 (Specialized program) Input2 Output Program Analyse Annotated Program It’s Input1 that will be given first Specializer ProgramInput1 (Specialized program) Output Input1 Input2 Online SpecializationOffline Specialization

DyC’s Run-time Specialization It’s specialization that occurs at run-time We define run-time by the availability of the input The optimization entails an obvious trade-off cost/performance Typically expected situation for a couple of variables: –The static variable is constant and appears earlier, and –The dynamic variable appears later and is not constant The DyC system is a run-time specialization system with offline specialization of FCL

Program Input1 Run-time Specializer ProgramInput1 (Specialized program) Input2 Output Program Analysis Annotated Program It’s Input1 that will be given first Run-time Specializer ProgramInput1 (Specialized program) Output Input1 Input2 Online SpecializationOffline Specialization Run-time

Program Analysis Annotated Program Run-time Specializer ProgramInput1 (Specialized program) Output Input1 Input2 Offline Specialization Program Analysis Annotated Program ProgramInput1 (Specialized program) Output Input1 Input2 DyC (Generating Extension) It’s Input1 that will be given first It’s Input1 that will be given first Generating Extension (Custom Specializer) Cogen Run-time Specialized version of the specializer w.r.t. to the annotated program

‘It’s Input1 that will be given first’ make_static (n) m (init) init:result := 1; goto test; test:if n < 1 then end else loop loop:result := result * m; n =: n – 1 goto test: end:return result; 5252 52 Run-time

‘It’s Input1 that will be given first’ Specialize program w.r.t. n if not in cache -- make_static (n) m (init) init:result := 1; goto test; test:if n < 1 then end else loop loop:result := result * m; n =: n – 1 goto test: end:return result; 5252 52 Run-time

‘It’s Input1 that will be given first’ 5252 52 Run-time Specialize program w.r.t. n if not in cache -- make_static (n) m (init) … Program4 Cache Program3

‘It’s Input1 that will be given first’ 5252 52 Run-time Specialize program w.r.t. n if not in cache -- make_static (n) m (init) … Program4 Cache Program3 Program2

Annotation-directed Run-time Optimization The underlying language is a version of the Flow-Chart Language Annotations help avoiding non-termination and unneeded specialization: –Eager : Aggressive speculative specialization –Lazy : Demand-driven specialization Annotations guide cache policy: –CacheAllUnchecked variable: disposable code –CacheOne variable: the current version is cached

Program Analysis Annotated Program ProgramInput1 (Specialized program) Output Input1 Input2 DyC (Generating Extension) Make_static (…) Generating Extension (Custom Specializer) Cogen Run-time Annotated C Annotated Intermediate Representation Native Code

The Dynamo System 6 years long project Transparent operation (custom crt0.o ) Dynamo is an PA-8000 code interpreter Assumption: “Most of the time is spent in a small portion of the code” Performance opportunities: –Redundancies that cross program boundaries –Cache utilization It interprets until a trace is detected For which it generates a fragment

Most Recently Executed Tail A trace is delimited by start-of-trace and end-of- trace conditions Start-of-trace condition: –Target of backward-taken branches loop header –Taken branches from fragment code exit End-of-trace condition: –Backward-taken branches only loops whose header is the start-of-trace are allowed to appear in the trace –Taken branches to fragment code entry Each branch is associated with a counter

Interpret until taken branch Native instruction stream Run-time Interpreter Native Code

Interpret until taken branch Native instruction stream Run-time Interpreter Lookup branch target in cache Increment counter Associated with Branch target addr Counter value Exceeds hot threshold? Start-of-trace condition? Interpret+codegen until taken branch miss no yesno yes End-of-trace condition? no yes Native Code

Interpret until taken branch Native instruction stream Run-time Interpreter Fragment cache Create new fragment and optimize it Emit into cache, link with other fragments & recycle the associated counter Optimizer Intermediate Representation

Interpret until taken branch Lookup branch target in cache Native instruction stream Jump to top of fragment in cache Fragment cache Context switch Increment counter Associated with Branch target addr Counter value Exceeds hot threshold? Start-of-trace condition? Interpret+codegen until taken branch End-of-trace condition? Create new fragment and optimize it Emit into cache, link with other fragments & recycle the associated counter hit miss no yesno yes no yes Run-time Interpreter Optimizer Native Code Intermediate Representation

Dynamo: Notes The whole overhead is less than 1.5% The optimization part contribution is negligible The average overall speedup is about 9% Dynamic branches are treated in a lazy way: –We may loop to the start-of-trace –If the target is in the cache, we’re done –If not, we return to the interpreter The optimizer actually performs a partial evaluation similar to DyC’s one

Run-time Specialization for ML The ML virtual machine features partial application by means of closures We may see currying as an annotation that guides run-time specialization By use of a pe annotation, we would like to perform on-demand specialization –merge list1 list2 –(pe merge list1) list2 In the context of a virtual machine, specilization is a bytecode-to-bytecode transformation

Program Input1 Run-time Specializer ProgramInput1 (Specialized program) Input2 Output Online Specialization + JIT Run-time JIT ProgramInput1 (Specialized program) ML bytecode Native code pe

Program Run-time Specializer ProgramInput1 (Specialized program) Input2 Output Online Specialization + JIT JIT ProgramInput1 (Specialized program) Program Analysis Annotated Program ProgramInput1 (Specialized program) Output Input1 Input2 DyC (Generating Extension) Make_static (…) Generating Extension (Custom Specializer) Cogen Run-time pe Input1 Run-time

Program Input1 Run-time Specializer ProgramInput1 (Specialized program) Input2 Output Online Specialization + JIT Run-time JIT ProgramInput1 (Specialized program) Standard Compilation Portability Program Analysis Annotated Program ProgramInput1 (Specialized program) Output Input1 Input2 DyC (Generating Extension) Make_static (…) Generating Extension (Custom Specializer) Cogen Run-time Non-Standard Compilation Portability Offline Strategy Online Strategy pe Reusable

switch (*code_ptr) { case ACC: code_ptr++; accu := stack[*code_ptr++]; break; case POP: code_ptr++; stack += *code_ptr++; break; … void* array[] = {&&lbl_ACC, &&lbl_POP, …} lbl_ACC: code_ptr++; accu := stack[*code_ptr++]; goto *array[*code_ptr]; lbl_POP: code_ptr++; stack += *code_ptr; goto *array[*code_ptr]; … Interpreter Threaded Interpreter Just-in-time Compilation

Just-in-time compilation is the natural step following threaded code Also involves inlining of the virtual machine Implemented via the GCC’s asm instruction ACC arg = 0x8B 0x44 0x24 0x4*arg (movl 4*arg(%esp,1),%eax) POP arg = 0x59 0x59 … (popl %ecx;popl %ecx;…) CONSTINT arg = 0xB9 0x00 0x00 0x00 0xarg 0xD1 0xE0 0x40 (movl $arg,%eax; shl %eax; inc %eax)

Specialization Algorithm Implemented as an interpreter/[JIT-]compiler Performs an aggressive: –Constant propagation –Unfolding of [recursive] function calls Manipulates symbolic values: –

Mixed Computation ACC 1 PUSH ACC 1 ADDINT 0: 1: CONSTINT 7 0: 1: ACC 0 PUSH CONSTINT 3 ADDINT 0: 1: ACC 0 PUSH CONSTINT 4 PUSH ACC 1 ADDINT POP 1 0: 1: ACC 1 PUSH ACC 1 ADDINT Specializer Subject Program Specialized Program Context Interpreting static expressions Residualizing dynamic expressions

Program Point Specialization We define a program point as either: –the entry of the program, –the else-branch of dynamic conditional branches, –the application of a recursive function We define a context as the state of the stack when we enter a program point Each specialized program point is associated with a residual code We maintain a cache of recursive function applications together with a its context

Algorithm processed = {} pending = {pp0, cx0} while pending != {} do {pp, cx} := one element of pending if {pp, cx} notin processed then specialize pp w.r.t. cx processed := processed + {pp, cx} pending := pending – {pp, cx} arrange processed Input: A program point and a context Output: The corresponding specialized program Specializer

Notes Non-termination is avoided by means of the cache of recursive function applications: A BRANCHEND bytecode is added to make lifting easier in a dynamic context Branches with side-effect are ruled dynamic The actual work is perforned by the specialize function

Conclusion We’ve seen two recent run-time optimizing systems (PLDI’99 & PLDI’00) DyC performs accurate optimizations thanks to: –advanced specialization –programmer annotations Dynamo’s faster than native code execution: –despite the interpretive overhead –while staying completely transparent We would like in a similar vein to: –speed up ML interpretation –while retaining portability

Online partial evaluation of bytecodes (3)

Similar presentations

Presentation on theme: "Online partial evaluation of bytecodes (3)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Online partial evaluation of bytecodes (3)

Similar presentations

Presentation on theme: "Online partial evaluation of bytecodes (3)"— Presentation transcript:

Similar presentations

About project

Feedback