1 ROGUE Dynamic Optimization Framework Using Pin Vijay Janapa Reddi PhD. Candidate - Electrical And Computer Engineering University of Colorado at Boulder.

Slides:



Advertisements
Similar presentations
Profiler In software engineering, profiling ("program profiling", "software profiling") is a form of dynamic program analysis that measures, for example,
Advertisements

Instrumentation of Linux Programs with Pin Robert Cohn & C-K Luk Platform Technology & Architecture Development Enterprise Platform Group Intel Corporation.
Evaluating Indirect Branch Handling Mechanisms in Software Dynamic Translation Systems Jason D. Hiser, Daniel Williams, Wei Hu, Jack W. Davidson, Jason.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
1 Chapter 8: Code Generation. 2 Generating Instructions from Three-address Code Example: D = (A*B)+C =* A B T1 =+ T1 C T2 = T2 D.
Integrity & Malware Dan Fleck CS469 Security Engineering Some of the slides are modified with permission from Quan Jia. Coming up: Integrity – Who Cares?
Chapter 10 Code Optimization. A main goal is to achieve a better performance Front End Code Gen Intermediate Code source Code target Code user Machine-
Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.
Pin : Building Customized Program Analysis Tools with Dynamic Instrumentation Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
1 S. Tallam, R. Gupta, and X. Zhang PACT 2005 Extended Whole Program Paths Sriraman Tallam Rajiv Gupta Xiangyu Zhang University of Arizona.
The Path to Multi-core Tools Paul Petersen. Multi-coreToolsThePathTo 2 Outline Motivation Where are we now What is easy to do next What is missing.
Persistent Code Caching Exploiting Code Reuse Across Executions & Applications † Harvard University ‡ University of Colorado at Boulder § Intel Corporation.
6/9/2015© Hal Perkins & UW CSEU-1 CSE P 501 – Compilers SSA Hal Perkins Winter 2008.
August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC.
The Use of Traces for Inlining in Java Programs Borys J. Bradel Tarek S. Abdelrahman Edward S. Rogers Sr.Department of Electrical and Computer Engineering.
Cpeg421-08S/final-review1 Course Review Tom St. John.
CS 536 Spring Intermediate Code. Local Optimizations. Lecture 22.
San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Overview of program analysis Mooly Sagiv html://
Intermediate Code. Local Optimizations
1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.
Code Coverage Testing Using Hardware Performance Monitoring Support Alex Shye, Matthew Iyer, Vijay Janapa Reddi and Daniel A. Connors University of Colorado.
Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.
Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.
AutoHacking with Phoenix Enabled Data Flow Analysis Richard Johnson |
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Optimization software for apeNEXT Max Lukyanov,  apeNEXT : a VLIW architecture  Optimization basics  Software optimizer for apeNEXT  Current.
CCS APPS CODE COVERAGE. CCS APPS Code Coverage Definition: –The amount of code within a program that is exercised Uses: –Important for discovering code.
JIT in webkit. What’s JIT See time_compilation for more info. time_compilation.
Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,
Oct Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
1 Instrumentation of Intel® Itanium® Linux* Programs with Pin download: Robert Cohn MMDC Intel * Other names and brands.
1 Software Instrumentation and Hardware Profiling for Intel® Itanium® Linux* CGO’04 Tutorial 3/21/04 Robert Cohn, Intel Stéphane Eranian, HP CK Luk, Intel.
CSc 453 Final Code Generation Saumya Debray The University of Arizona Tucson.
Lengthening Traces to Improve Opportunities for Dynamic Optimization Chuck Zhao, Cristiana Amza, Greg Steffan, University of Toronto Youfeng Wu Intel Research.
1 CS 201 Compiler Construction Introduction. 2 Instructor Information Rajiv Gupta Office: WCH Room Tel: (951) Office.
Scalable Support for Multithreaded Applications on Dynamic Binary Instrumentation Systems Kim Hazelwood Greg Lueck Robert Cohn.
Targeted Path Profiling : Lower Overhead Path Profiling for Staged Dynamic Optimization Systems Rahul Joshi, UIUC Michael Bond*, UT Austin Craig Zilles,
Practical Path Profiling for Dynamic Optimizers Michael Bond, UT Austin Kathryn McKinley, UT Austin.
Trace Fragment Selection within Method- based JVMs Duane Merrill Kim Hazelwood VEE ‘08.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.
Retargetting of VPO to the tms320c54x - a status report Presented by Joshua George Advisor: Dr. Jack Davidson.
2 Processor(s)Main MemoryDevices Process, Thread & Resource Manager Memory Manager Device Manager File Manager.
CS412/413 Introduction to Compilers Radu Rugina Lecture 18: Control Flow Graphs 29 Feb 02.
1 Control Flow Graphs. 2 Optimizations Code transformations to improve program –Mainly: improve execution time –Also: reduce program size Can be done.
Branch Prediction Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
© 2006 Andrew R. BernatMarch 2006Generalized Code Relocation Generalized Code Relocation for Instrumentation and Efficiency Andrew R. Bernat University.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.
*Pentium is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries Performance Monitoring.
Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore
PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,
Code Optimization Overview and Examples
High-level optimization Jakub Yaghob
Code Optimization.
Gift Nyikayaramba 30 September 2014
Olatunji Ruwase* Shimin Chen+ Phillip B. Gibbons+ Todd C. Mowry*
PinADX: Customizable Debugging with Dynamic Instrumentation
Feedback directed optimization in Compaq’s compilation tools for Alpha
Optimizing Transformations Hal Perkins Autumn 2011
Code Optimization Overview and Examples Control Flow Graph
rePLay: A Hardware Framework for Dynamic Optimization
CSE P 501 – Compilers SSA Hal Perkins Autumn /31/2019
CSc 453 Interpreters & Interpretation
Dynamic Binary Translators and Instrumenters
Presentation transcript:

1 ROGUE Dynamic Optimization Framework Using Pin Vijay Janapa Reddi PhD. Candidate - Electrical And Computer Engineering University of Colorado at Boulder Intel Mentors: Robert S. Cohn & C.K. Luk Internship at Intel MMDC

2 Motivation Most optimizers are “black-box” style –Limited ability for customization Provide more open API for optimization –Profiling, trace building, optimization, cache management –Include all of Pin API for instrumentation –Flexible, but hide low level details of JIT

3 Research Education –University of Colorado at Boulder Advanced Computer Architecture Code Generation And Optimization –Pin: A Binary Instrumentation Tool for Computer Architecture Research and Education (WCAE 2004) Potential Users

4 Pin Model Original Code A BC D FE A’ B’ E’ D’ Code Cache Pin Dispatcher Instrumented Code

5 ROGUE Model Original Code A BC D FE A’ C’ F’ D’ Code Cache Hot Path E D B A C Original Code F

6 How is ROGUE different from Pin? Pin –Instrumentation only –Fixed method for building traces –Application only executes out of code cache ROGUE –Optimization and profiling (instrumentation or hardware) –User defined trace building –Application executes a mix: Hot traces (code cache) Instrumented traces (code cache) Original program (program memory)

7 Dynamic Optimization Flow Perform runtime analysis –Hardware performance monitoring unit Branch Target Buffer –Software profilers BBL’s Edges Path Generate optimized code sequences Patch original code to execute optimized code Repeat the flow

8 ROGUE Model Original Code A BC D FE A’ C’ F’ D’ Code Cache Hot Path E D B A C Original Code F

9 Code Layout Profile information –Edge profiler –Path profiler Code fetching mechanism –Fetch a range of instructions, basic blocks etc. Perform optimizations

10 Collecting Profile Information A BC D FE Step 0: Instrument all edges INS_InsertCall(ins, IPOINT_TAKEN_BRANCH, (AFUNPTR) TakenBr, IARG_PTR taken_edg, … IARG_END); INS_InsertCall(ins, IPOINT_AFTER, (AFUNPTR) Fallthrough, IARG_PTR fallthrough_edg, … IARG_END); Pin Instrumentation

11 Code Fetching A BC D FE Step 1: Fetch the hot target basic block A 0x80abcdef BBL bbl = BBL_Fetch(0x80abcdef) Hot edge  Use any threshold metric E.g.: Execution count threshold = 100

12 Trace Generation A BC D FE Step 2: Create a trace to hold the fetched bbl TRACE trace = TRACE_Alloc(bbl) A’

13 Trace Generation A BC D FE for( EDG edg = BBL_EdgHead(bbl); EDG_Valid(edg); edg = EDG_Next(edg) ) { … if (maxedg_cnt < cnt) { maxedg_cnt = cnt; maxedg = edg; } … } Step 3: Walk the flow graph

14 Trace Generation A BC D FE Step 4: Add the new hot edge target to trace bbl = TRACE_AddEdg(trace, bbl, maxedg); TRACE_AddInlineCallEdg TRACE_AddInlineReturnEdg TRACE_AddBranchEdg A’ C’

15 Trace Generation A BC D FE Step 5: Repeat Step 3 … Step 4 till Trace termination Probability Loopback Identification Max. number of instructions per trace …

16 Trace Generation Step 6: Finalize Trace generation TRACE_GenerateCode(trace) 1. Straighten Control Flow -Branch inversion, redundant branch elimination, handling call/return inlining and exit path fix-ups. 2. Compile the trace 3. Enter trace into code cache 4. Patch references to this trace -Any other edges that refer to the same target address can all be patched to refer to the new optimized trace A’ C’ F’ D’ E D B A C F

17 Example tool summary Runtime Optimization Guided Using Edges Trace Generation –Loop unrolling –Inline call and return paths Optimizations in the future –Eliminate redundant branches after code layout –Constant propagation –Dead code elimination –Constant Sub-expression Elimination –…

18 Fetch a basic block starting from an address Invoke when some threshold metric is reached Initialize a new trace with the fetched basic block Walk flowgraph to find the hot edge Add hot path instructions to trace Use probability as a trace termination metric Fix-up control flow, compile, patch & enter trace into cache VOID TraceGenerator(ADDRINT address) { EDG maxedg; UINT32 prob, sumedg_cnt; BBL bbl = BBL_Fetch(address); TRACE trace = TRACE_Alloc(bbl); while (prob > 0.4) { for (EDG edg = BBL_EdgHead(bbl); EDG_Valid(edg); edg = EDG_Next(edg)) { edg_cnt = EdgProfilerCount( EDG_BblSrc(edg), EDG_BblDst(edg) ); if (maxedg_cnt < edg_cnt) { maxedg = edg; maxedg_cnt = edg_cnt; } sumedg_cnt += edg_cnt; } bbl = TRACE_AddEdg(trace, bbl, maxedg); prob *= maxedg_cnt/sumedg_cnt; } TRACE_GenerateCode(trace); } A simple trace generator using ROGUE

19 ROGUE Optimization Comparison GCC Opt. Level 3

20 ROGUE Optimization Comparison Intel Compiler

21 The ROGUE Vision Application Code Cache Re-Optimizations Optimized Traces Observe execution behavior Trace Generator Optimizer Cache Manager HW. Perf. Unit Phase Detector Instrumentation Monitor ROGUE

22 The ROGUE Vision (2) Dynamic Optimizer Interface –Trace Generator Control trace generation (path, size, thresholds…) –Monitor Register callbacks to trigger trace generation –Optimizer Provided with some standard optimizations Ability to write custom optimizations (add/delete instructions) –Cache manager Placement strategies of generated traces in the code cache Patching of original code use optimized code in code cache Dynamic Optimization Engine –Build a dynamic optimizer using the interface

23 ROGUE ROGUE Current Status Application Code Cache Re-Optimizations Optimized Traces Observe execution behavior Trace Generator Optimizer Cache Manager Functional Modules HW. Perf. Unit Phase Detector Instrumentation Monitor

24 ROGUE Summary Dynamic optimization framework –Facilitates the construction of customizable dynamic optimizers via high level abstraction –Tool for research and teaching API (Application Programmer Interface) –New API to perform dynamic optimizations –Inherits the complete PIN 2.0 API