Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham 4-30-2003.

Slides:



Advertisements
Similar presentations
Code Optimization and Performance Chapter 5 CS 105 Tour of the Black Holes of Computing.
Advertisements

Memory.
Part IV: Memory Management
Programming Languages and Paradigms
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
Adapted from Scott, Chapter 6:: Control Flow Programming Language Pragmatics Michael L. Scott.
Control Structures Any mechanism that departs from straight-line execution: –Selection: if-statements –Multiway-selection: case statements –Unbounded iteration:
1 Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.
Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.
The Assembly Language Level
Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.
Chapter 3 Loaders and Linkers
Programming Languages Marjan Sirjani 2 2. Language Design Issues Design to Run efficiently : early languages Easy to write correctly : new languages.
Test Logging and Automated Failure Analysis Why Weak Automation Is Worse Than No Automation Geoff Staneff
Chapter 8 Runtime Support. How program structures are implemented in a computer memory? The evolution of programming language design has led to the creation.
Recap from last time We were trying to do Common Subexpression Elimination Compute expressions that are available at each program point.
Telescoping Languages: A Compiler Strategy for Implementation of High-Level Domain-Specific Programming Systems Ken Kennedy Rice University.
Fast Effective Dynamic Compilation Joel Auslander, Mathai Philipose, Craig Chambers, etc. PLDI’96 Department of Computer Science and Engineering Univ.
Memory Systems Performance Workshop 2004© David Ryan Koes MSP 2004 Programmer Specified Pointer Independence David Koes Mihai Budiu Girish Venkataramani.
© Copyright 1992–2004 by Deitel & Associates, Inc. and Pearson Education Inc. All Rights Reserved. Chapter 5 - Functions Outline 5.1Introduction 5.2Program.
CS 536 Spring Code generation I Lecture 20.
Assemblers Dr. Monther Aldwairi 10/21/20071Dr. Monther Aldwairi.
Previous finals up on the web page use them as practice problems look at them early.
Run-Time Storage Organization
Run time vs. Compile time
Chapter 5: Memory Management Dhamdhere: Operating Systems— A Concept-Based Approach Slide No: 1 Copyright ©2005 Memory Management Chapter 5.
1 Run time vs. Compile time The compiler must generate code to handle issues that arise at run time Representation of various data types Procedure linkage.
Chapter 7Louden, Programming Languages1 Chapter 7 - Control I: Expressions and Statements "Control" is the general study of the semantics of execution.
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
CH13 Reduced Instruction Set Computers {Make hardware Simpler, but quicker} Key features  Large number of general purpose registers  Use of compiler.
Operator Precedence First the contents of all parentheses are evaluated beginning with the innermost set of parenthesis. Second all multiplications, divisions,
 2006 Pearson Education, Inc. All rights reserved Arrays.
Programmer's view on Computer Architecture by Istvan Haller.
© Copyright 1992–2004 by Deitel & Associates, Inc. and Pearson Education Inc. All Rights Reserved. C How To Program - 4th edition Deitels Class 05 University.
Basic Semantics Associating meaning with language entities.
 2007 Pearson Education, Inc. All rights reserved C Functions -Continue…-
The LC-3 – Chapter 7 COMP 2620 Dr. James Money COMP
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.
CPSC 388 – Compiler Design and Construction Optimization.
CS 2130 Lecture 5 Storage Classes Scope. C Programming C is not just another programming language C was designed for systems programming like writing.
1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”
Execution of an instruction
MT311 Java Application Development and Programming Languages Li Tak Sing( 李德成 )
Operating Systems Lecture 14 Segments Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing Liu School of Software Engineering.
Automated Patch Generation Adapted from Tevfik Bultan’s Lecture.
CSE 425: Control Abstraction I Functions vs. Procedures It is useful to differentiate functions vs. procedures –Procedures have side effects but usually.
ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.
How to execute Program structure Variables name, keywords, binding, scope, lifetime Data types – type system – primitives, strings, arrays, hashes – pointers/references.
1. Searching The basic characteristics of any searching algorithm is that searching should be efficient, it should have less number of computations involved.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
CS412/413 Introduction to Compilers and Translators April 2, 1999 Lecture 24: Introduction to Optimization.
Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)
OCR A Level F453: The function and purpose of translators Translators a. describe the need for, and use of, translators to convert source code.
LECTURE 19 Subroutines and Parameter Passing. ABSTRACTION Recall: Abstraction is the process by which we can hide larger or more complex code fragments.
CS 536 © CS 536 Spring Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 15.
Code Optimization Overview and Examples
High-level optimization Jakub Yaghob
Code Optimization.
An Evaluation of Staged Run-Time Optimizations in DyC
Revision Lecture
Compilers.
Computer Engg, IIT(BHU)
Chapter 5 - Functions Outline 5.1 Introduction
Inlining and Devirtualization Hal Perkins Autumn 2011
Calpa: A Tool for Automating Dynamic Compilation
Dynamic Binary Translators and Instrumenters
Code Optimization.
Presentation transcript:

Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham

What is Runtime Code Generation (RTCG)? Dynamic addition of code to the instruction stream Restricted to instructions executed directly by hardware

Problems with RTCG Reentrant code Portability (HL languages vs. assembly) Data vs. Code issues –Caches and memory –Standard compilation schema Maintainability and understandability

Benefits of RTCG Adaptation to the architecture (cache sizes, various latencies, etc.) JIT compilation No profiling (actual data is available) Literals enable optimizations unknown or impossible at runtime Potentially more compact (for caches)

Dynamic Compilation Trade-offs Execution time is linear in run count Choice between lower startup cost, lower incremental cost Unoptimized Static Code Low Optimized Dynamic Code Optimized Static Code High Optimized Dynamic Code Input Execution Time

Observation Programmers write the common case –blit routines –image display Applications have repetitious data –simulators –Regexp matching Optimizations –sparse matrices

One Tack: Specialization Take a piece of code and replace variables with constants Enables various optimizations –Strength reduction –Constant propagation –etc.... Generate explicitly or implicitly Possibly reuse

Example Of Specialization int dot_product(int size, int u[], int v[]) { int res = 0; for (i = 0; i < size; i++) { res = res + u[i] * v[i]; } return res; } Suppose size == 5, u == {14,0,38,12,1}

Example Of Specialization int dot_product_1(int v[]) { int res = 0; for (i = 0; i < 5 ; i++) { res = res + {14,0,38,12,1}[i] * v[i]; } return res; } Substitute in the values

Example Of Specialization int dot_product_1(int v[]) { int res = 0; res = res + 14 * v[0]; res = res + 0 * v[1]; res = res + 38 * v[2]; res = res + 12 * v[3]; res = res + 1 * v[4]; return res; } Unroll the loop

Example Of Specialization int dot_product_1(int v[]) { int res; res = 14 * v[0]; res = res + 38 * v[2]; res = res + 12 * v[3]; res = res + v[4]; return res; } Eliminate unneeded code

DyC make_static annotation indicates which variables to specialize with respect annotation indicates static loads (a reload is not needed) int dot_product(int size, int u[], int v[]) { make_static(size, u); int res = 0; for (i = 0; i < size; i++) { res = res + * v[i]; } return res; }

DyC Specializer Each region has a runtime specializer Setup computations are run The values are plugged into holes in code templates The resultant code is optimized The result is translated to machine code and run

DyC Optimizations Polyvariant specialization Internal dynamic-to-static promotions Unchecked dispatching Complete loop unrolling Polyvariant division and conditional specialization Static loads and calls Strength reduction, zero and copy propagation, and dead-assignment elimination (precomputed!)

DyC Annotations Runtime constants and constant functions Specialization/division should be mono- /poly-variant Disable/enable internal promotions Compile eagerly/lazily downstream of branches Code caching style at merges/promotions Interprocedural specialization

Manual Annotation Profile to find potential gains Concentrate on areas with high execution times If unobvious, log values of parameters to find runtime constants Trial and error loop unrolling

Applications

Optimizations used

Break Even Points

Performance

Speedup without a given feature

Calpa A system that automatically generates DyC annotations Profiles code, collecting statistics Analyses results Annotates code Basically, automates what previously was done manually

Calpa, Step 1 Instrumentation tool instruments the original binary Executed on representative input Generates summarized value and frequency data Fed into next step

The Instrumenter Three types of information collected –Basic block execution frequencies –Variable definitions –Variable uses Points-to info for invalidation of constants necessary for safety uses stored as value/occurrence pairs, with procedure invocation noted, for groups of related values in a procedure

Profiling Data

Profiling Seconds to hours Naive profiling was sufficient for their purposes, and so left unoptimized Another paper describes more efficient profile gathering

Calpa, Step 2 Annotation tool searches possible space of annotations Selects annotations and creates annotated program Passed to DyC, which compiles the program Calpa == policy, DyC == mechanism

Canadate Static Variable (CSV) Sets A CSV set is the set of CSVs that make an instruction static Propagate if exactly one definition exists

CSV Sets Example i = 0{} L1:if i >= size goto L2{i, size} uelem = u[i]{i, u[]} velem = v[i]{i, v[]} t = uelem * velem{i, u[], v[]} sum =sum + t{i, sum, u[], v[]} i = i + 1{i} goto L1{} L2:

Candidate Division (CD) Sets A CD is a set of CSVs Set of static instructions in a CD are those instructions whose CSV sets are subsets of the CD The CD Set is all CDs produced from some combination of the CSV sets No need to consider other CDs (21 out of 32)

CD Sets Example {}{i}{i, size}{i, u[]}{i, v[]}{i, u[], v[]}{i, sum, u[], v[]} {i, size, u[]} {i, size, v[]} {i, size, u[], v[]} {i, size, sum, u[], v[]}

Search of CD Space The CDs are enumerated, starting with the least variable variables As additional CDs are enumerated, the "best" one is kept The search terminates if –All CDs are enumerated –a time quota expires –the improvement over the "best" so far drops below a threshold

Cost Model Specialization cost –Basic block size * # of vals –Loop size * # of values of the induction variable (scale for multiway loops) –Total instruction * instruction generation cost Cache cost –Lookup cost –Hash key construction (# of vars * cost per var) –Except if unchecked policy is used Invalidation cost –Sum of execution frequency * invalidate cost for all invalidation points

Benefit Model Runs a simplified DyC analysis Assumes whole procedure specialization (overestimating costs) Count number of saved cycles assuming the given CD Only looks at the critical path (a simplifying assumption) A win if saved cycles > cycle cost

Calpa is safe Static, unchecked, and eager annotations are selected when profile information hints at this However, these are unsafe Calpa does invalidations at all points when it could upset safety Also makes pessimistic assumptions about external routines It is always safe to avoid these annotations

Testing Tested on previously annotated programs The annotation process was much quicker Found all manual annotations Plus more annotations

Annotations found All the manual ones, plus two more –Search key in search program –Vector v in dotproduct The unvarying nature of these variables was an artifact of atypical use But getting good profiling input is someone else's research

Related Work Value Profiling –Work that builds an efficient system to profile programs with the aim of using the collected information to drive specialization. They do not do value sequence information collection, which Calpa needs. They also do binary instrumentation. Fabius –This system takes functions that are curried and generate code for partially evaluated functions. Thus, the idiom of currying is leveraged to optimize code at runtime. Tick C –`C extends C with a few additional constructs that allow explicit manual runtime code compilation. You specify exactly what code that you wish to have compiled in C-like fragments. In the spirit of runtime trade-offs, code generation can be in one of two forms, one quick to create, and one more efficient. Tempo –Tempo can either be a source to source compilation, or a source to runtime code generator. It is much more limited in scope than DyC/Calpa. However, it does have an automatic side-effect/alias analysis.