University of Colorado

Slides:

Advertisements

Similar presentations

Practical Malware Analysis

Advertisements

Profiler In software engineering, profiling ("program profiling", "software profiling") is a form of dynamic program analysis that measures, for example,

Instrumentation of Linux Programs with Pin Robert Cohn & C-K Luk Platform Technology & Architecture Development Enterprise Platform Group Intel Corporation.

Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,

Evaluating Indirect Branch Handling Mechanisms in Software Dynamic Translation Systems Jason D. Hiser, Daniel Williams, Wei Hu, Jack W. Davidson, Jason.

Dynamic Optimization using ADORE Framework 10/22/2003 Wei Hsu Computer Science and Engineering Department University of Minnesota.

Integrity & Malware Dan Fleck CS469 Security Engineering Some of the slides are modified with permission from Quan Jia. Coming up: Integrity – Who Cares?

RIVERSIDE RESEARCH INSTITUTE Helikaon Linux Debugger: A Stealthy Custom Debugger For Linux Jason Raber, Team Lead - Reverse Engineer.

Pin : Building Customized Program Analysis Tools with Dynamic Instrumentation Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff.

SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

The PinPoints Toolkit for Finding Representative Regions of Large Programs Harish Patil Platform Technology & Architecture Development Enterprise Platform.

Pipelined Profiling and Analysis on Multi-core Systems Qin Zhao Ioana Cutcutache Weng-Fai Wong PiPA.

Persistent Code Caching Exploiting Code Reuse Across Executions & Applications † Harvard University ‡ University of Colorado at Boulder § Intel Corporation.

TaintCheck and LockSet LBA Reading Group Presentation by Shimin Chen.

1 Operating Systems and Protection CS Goals of Today’s Lecture How multiple programs can run at once  Processes  Context switching  Process.

PC hardware and x86 3/3/08 Frans Kaashoek MIT

1 ICS 51 Introductory Computer Organization Fall 2006 updated: Oct. 2, 2006.

Accessing parameters from the stack and calling functions.

San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation.

Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

Pin2 Tutorial1 Pin Tutorial Kim Hazelwood Robert Muth VSSAD Group, Intel.

Previous Next 06/18/2000Shanghai Jiaotong Univ. Computer Science & Engineering Dept. C+J Software Architecture Shanghai Jiaotong University Author: Lu,

6.828: PC hardware and x86 Frans Kaashoek

Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,

1 Dimension: An Instrumentation Tool for Virtual Execution Environments Jing Yang, Shukang Zhou and Mary Lou Soffa Department of Computer Science University.

Process Virtualization and Symbiotic Optimization Kim Hazelwood ACACES Summer School July 2009.

CSC 310 – Imperative Programming Languages, Spring, 2009 Virtual Machines and Threaded Intermediate Code (instead of PR Chapter 5 on Target Machine Architecture)

- 1 - Copyright © 2006 Intel Corporation. All Rights Reserved. Using the Pin Instrumentation Tool for Computer Architecture Research Aamer Jaleel, Chi-Keung.

Introduction: Exploiting Linux. Basic Concepts Vulnerability A flaw in a system that allows an attacker to do something the designer did not intend,

CSc 453 Runtime Environments Saumya Debray The University of Arizona Tucson.

1 Instrumentation of Intel® Itanium® Linux* Programs with Pin download: Robert Cohn MMDC Intel * Other names and brands.

1 Software Instrumentation and Hardware Profiling for Intel® Itanium® Linux* CGO’04 Tutorial 3/21/04 Robert Cohn, Intel Stéphane Eranian, HP CK Luk, Intel.

PMaC Performance Modeling and Characterization A Static Binary Instrumentation Threading Model for Fast Memory Trace Collection Michael Laurenzano 1, Joshua.

Fall 2012 Chapter 2: x86 Processor Architecture. Irvine, Kip R. Assembly Language for x86 Processors 6/e, Chapter Overview General Concepts IA-32.

Dynamic Compilation and Modification CS 671 April 15, 2008.

Transmeta and Dynamic Code Optimization Ashwin Bharambe Mahim Mishra Matthew Rosencrantz.

Scalable Support for Multithreaded Applications on Dynamic Binary Instrumentation Systems Kim Hazelwood Greg Lueck Robert Cohn.

Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

1 ICS 51 Introductory Computer Organization Fall 2009.

CNIT 127: Exploit Development Ch 1: Before you begin.

JIT Instrumentation – A Novel Approach To Dynamically Instrument Operating Systems Marek Olszewski Keir Mierle Adam Czajkowski Angela Demke Brown University.

AMD64/EM64T – Dyninst & ParadynMarch 17, 2005 The AMD64/EM64T Port of Dyninst and Paradyn Greg Quinn Ray Chen

Processes and Virtual Memory

Full and Para Virtualization

Part Two: Optimizing Pintools Robert Cohn Kim Hazelwood.

Functions/Methods in Assembly

Efficient software-based fault isolation Robert Wahbe, Steven Lucco, Thomas Anderson & Susan Graham Presented by: Stelian Coros.

Compiler Construction Code Generation Activation Records

University of Amsterdam Computer Systems – the instruction set architecture Arnoud Visser 1 Computer Systems The instruction set architecture.

Performance Optimization of Pintools C K Luk Copyright © 2006 Intel Corporation. All Rights Reserved. Reducing Instrumentation Overhead Total Overhead.

1 Assembly Language: Function Calls Jennifer Rexford.

1 ROGUE Dynamic Optimization Framework Using Pin Vijay Janapa Reddi PhD. Candidate - Electrical And Computer Engineering University of Colorado at Boulder.

Correct RelocationMarch 20, 2016 Correct Relocation: Do You Trust a Mutated Binary? Drew Bernat

Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 12-14, 2010 Paradyn Project Safe and Efficient Instrumentation Andrew Bernat.

Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore

Virtual Machine Monitors

Assembly language.

Performance Optimizations in Dyninst

Conditional Branch Example

Olatunji Ruwase* Shimin Chen+ Phillip B. Gibbons+ Todd C. Mowry*

PinADX: Customizable Debugging with Dynamic Instrumentation

Anton Burtsev February, 2017

Instruction-level Tracing: Framework & Applications

Fundamentals of Computer Organisation & Architecture

X86 Assembly Review.

System Calls System calls are the user API to the OS

Dynamic Binary Translators and Instrumenters

Computer Architecture and System Programming Laboratory

Presentation transcript:

University of Colorado Pin Building Customized Program Analysis Tools with Dynamic Instrumentation CK Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Kim Hazelwood Intel Vijay Janapa Reddi University of Colorado http://rogue.colorado.edu/Pin PLDI’05

C Pin is a new dynamic binary instrumentation system Insert extra code into programs to collect information about execution Program analysis: Code coverage, call-graph generation, memory-leak detection Architectural study: Processor simulation, fault injection Existing binary-level instrumentation systems: Static: ATOM, EEL, Etch, Morph Dynamic: Dyninst, Vulcan, DTrace, Valgrind, Strata, DynamoRIO C Pin is a new dynamic binary instrumentation system PLDI’05

Advantages of Pin Instrumentation Easy-to-use Instrumentation API Instrumentation code written in C/C++/asm ATOM-like API, based on procedure calls Instrumentation tools portable across platforms Same tools work on IA32, EM64T (x86-64), Itanium, ARM Same tools work on Linux and Windows (ongoing work) Low instrumentation overhead Pin automatically optimizes instrumentation code Pin can attach instrumentation to a running process Robust Handle mixed code and data, variable-length instructions, dynamically-generated code Transparent Application sees original addresses, values, and stack content PLDI’05

A Pintool for Tracing Memory Writes #include <iostream> #include "pin.H" FILE* trace; VOID RecordMemWrite(VOID* ip, VOID* addr, UINT32 size) { fprintf(trace, “%p: W %p %d\n”, ip, addr, size); } VOID Instruction(INS ins, VOID *v) { if (INS_IsMemoryWrite(ins)) INS_InsertCall(ins, IPOINT_BEFORE, AFUNPTR(RecordMemWrite), IARG_INST_PTR, IARG_MEMORYWRITE_EA, IARG_MEMORYWRITE_SIZE, IARG_END); int main(int argc, char * argv[]) { PIN_Init(argc, argv); trace = fopen(“atrace.out”, “w”); INS_AddInstrumentFunction(Instruction, 0); PIN_StartProgram(); return 0; executed immediately before a write is executed Same source code works on the 4 architectures => Pin takes care of different addressing modes No need to manually save/restore application state => Pin does it for you automatically and efficiently executed when an instruction is dynamically compiled PLDI’05

Dynamic Instrumentation Original code Code cache 7’ 2’ 1’ Exits point back to Pin 2 3 1 7 4 5 6 Pin Pin fetches trace starting block 1 and start instrumentation PLDI’05

Dynamic Instrumentation Original code Code cache 1 7’ 2’ 1’ 2 3 5 4 6 7 Pin Pin transfers control into code cache (block 1) PLDI’05

Dynamic Instrumentation Original code Code cache trace linking 2 3 1 7 4 5 6 7’ 2’ 1’ 6’ 5’ 3’ Pin Pin fetches and instrument a new trace PLDI’05

Pin’s Software Architecture Address space Pintool 3 programs (Pin, Pintool, App) in same address space: User-level only Instrumentation APIs: Through which Pintool communicates with Pin JIT compiler: Dynamically compile and instrument Emulation unit: Handle insts that can’t be directly executed (e.g., syscalls) Code cache: Store compiled code => Coordinated by VM Pin Instrumentation APIs Virtual Machine (VM) Application Code Cache JIT Compiler Emulation Unit Operating System Hardware PLDI’05

Pin Internal Details Loading of Pin, Pintool, & Application An Improved Trace Linking Technique Register Re-allocation Instrumentation Optimizations Multithreading Support PLDI’05

Register Re-allocation Instrumented code needs extra registers. E.g.: Virtual registers available to the tool A virtual stack pointer pointing to the instrumentation stack Many more … Approaches to get extra registers: Ad-hoc (e.g., DynamoRIO, Strata, DynInst) Whenever you need a register, spill one and fill it afterward Re-allocate all registers during compilation Local allocation (e.g., Valgrind) Allocate registers independently within each trace Global allocation (Pin) Allocate registers across traces (can be inter-procedural) PLDI’05

Valgrind’s Register Re-allocation Original Code Trace 1 mov 1, %eax mov 2, %esi cmp %ecx, %edx mov %eax, SPILLeax mov %esi, SPILLebx jz t’ mov 1, %eax mov 2, %ebx cmp %ecx, %edx jz t add 1, %eax sub 2, %ebx %edx %ecx %esi %ebx %eax Physical Virtual %edi re-allocate t: Trace 2 t’: mov SPILLeax, %eax mov SPILLebx ,%edi add 1, %eax sub 2, %edi C Simple but inefficient All modified registers are spilled at a trace’s end Refill registers at a trace’s beginning PLDI’05

Pin’s Register Re-allocation Scenario (1): Compiling a new trace at a trace exit mov 1, %eax mov 2, %ebx cmp %ecx, %edx jz t add 1, %eax sub 2, %ebx t: Original Code re-allocate Trace 2 mov 2, %esi jz t’ Trace 1 sub 2, %esi t’: Compile Trace 2 using the binding at Trace 1’s exit: %edx %ecx %esi %ebx %eax Physical Virtual C No spilling/filling needed across traces PLDI’05

Pin’s Register Re-allocation Scenario (2): Targeting an already generated trace at a trace exit Trace 1 (being compiled) Original Code mov 1, %eax mov 2, %esi cmp %ecx, %edx mov %esi, SPILLebx mov SPILLebx, %edi jz t’ mov 1, %eax mov 2, %ebx cmp %ecx, %edx jz t add 1, %eax sub 2, %ebx re-allocate %edx %ecx %esi %ebx %eax Physical Virtual %edi t: Trace 2 (in code cache) t’: add 1, %eax sub 2, %edi C Minimal spilling/filling code PLDI’05

Instrumentation Optimizations Inline instrumentation code into the application Avoid saving/restoring eflags with liveness analysis Schedule inlined instrumentation code PLDI’05

Example: Instruction Counting Original code cmov %esi, %edi cmp %edi, (%esp) jle <target1> add %ecx, %edx cmp %edx, 0 je <target2> BBL_InsertCall(bbl, IPOINT_BEFORE, docount(), IARG_UINT32, BBL_NumIns(bbl), IARG_END) C 33 extra instructions executed altogether Instrument without applying any optimization Trace bridge() mov %esp,SPILLappsp mov SPILLpinsp,%esp call <bridge> cmov %esi, %edi mov SPILLappsp,%esp cmp %edi, (%esp) jle <target1’> pushf push %edx push %ecx push %eax movl 0x3, %eax call docount pop %eax pop %ecx pop %edx popf ret docount() add %eax,icount ret mov %esp,SPILLappsp mov SPILLpinsp,%esp call <bridge> add %ecx, %edx cmp %edx, 0 je <target2’> PLDI’05

Example: Instruction Counting Original code cmov %esi, %edi cmp %edi, (%esp) jle <target1> add %ecx, %edx cmp %edx, 0 je <target2> Inlining Trace mov %esp,SPILLappsp mov SPILLpinsp,%esp pushf add 0x3, icount popf cmov %esi, %edi mov SPILLappsp,%esp cmp %edi, (%esp) jle <target1’> C 11 extra instructions executed mov %esp,SPILLappsp mov SPILLpinsp,%esp pushf add 0x3, icount popf add %ecx, %edx cmp %edx, 0 je <target2’> PLDI’05

Example: Instruction Counting Original code cmov %esi, %edi cmp %edi, (%esp) jle <target1> add %ecx, %edx cmp %edx, 0 je <target2> Inlining + eflags liveness analysis Trace mov %esp,SPILLappsp mov SPILLpinsp,%esp pushf add 0x3, icount popf cmov %esi, %edi mov SPILLappsp,%esp cmp %edi, (%esp) jle <target1’> C 7 extra instructions executed add 0x3, icount add %ecx, %edx cmp %edx, 0 je <target2’> PLDI’05

Example: Instruction Counting Original code cmov %esi, %edi cmp %edi, (%esp) jle <target1> add %ecx, %edx cmp %edx, 0 je <target2> Inlining + eflags liveness analysis + scheduling Trace cmov %esi, %edi add 0x3, icount cmp %edi, (%esp) jle <target1’> C 2 extra instructions executed add 0x3, icount add %ecx, %edx cmp %edx, 0 je <target2’> PLDI’05

Pin Instrumentation Performance Runtime overhead of basic-block counting with Pin on IA32 (SPEC2K using reference data sets) PLDI’05

Comparison among Dynamic Instrumentation Tools Runtime overhead of basic-block counting with three different tools Valgrind is a popular instrumentation tool on Linux Call-based instrumentation, no inlining DynamoRIO is the performance leader in binary dynamic optimization Manually inline, no eflags liveness analysis and scheduling C Pin automatically provides efficient instrumentation PLDI’05

Pin Applications Sample tools in the Pin distribution: Cache simulators, branch predictors, address tracer, syscall tracer, edge profiler, stride profiler Some tools developed and used inside Intel: Opcodemix (analyze code generated by compilers) PinPoints (find representative regions in programs to simulate) A tool for detecting memory bugs Some companies are writing their own Pintools: A major database vendor, a major search engine provider Some universities using Pin in teaching and research: U. of Colorado, MIT, Harvard, Princeton, U of Minnesota, Northeastern, Tufts, University of Rochester, … PLDI’05

Conclusions Pin Downloadable from http://rogue.colorado.edu/Pin A dynamic instrumentation system for building your own program analysis tools Easy to use, robust, transparent, efficient Tool source compatible on IA32, EM64T, Itanium, ARM Works on large applications database, search engine, web browsers, … Available on Linux; Windows version coming soon Downloadable from http://rogue.colorado.edu/Pin User manual, many example tools, tutorials 3300 downloads since 2004 July PLDI’05

Acknowledgments Prof Dan Connors Intel Bistro Team Mark Charney Hosting Pin website at U of Colorado Intel Bistro Team Providing the Falcon decoder/encoder Suggesting instrumentation scheduling Mark Charney Providing the XED decoder/encoder Ramesh Peri Implementing part of Itanium Instrumentation PLDI’05

Backup PLDI’05

Talk Outline A Sample Pintool Pin Internal Details Experimental Results Pin Applications Conclusions PLDI’05

Trace Linking Trace linking is a very effective optimization Bypass VM when transferring from one trace to another Slowdown without trace linking as much as 100x Linking direct branches/calls Straightforward as targets are unique Linking indirect branches/calls & returns More challenging because the target can be different each time Our approach: For all indirect control transfers, use chaining For returns, further optimizes with function cloning PLDI’05

Indirect Trace Linking original indirect jump jmp [%eax] chain of predicted targets LookupHtab: target_1’: if (T != target_1) jmp target_2’ … target_N’: if (hit) jmp translated[T] else call Pin if (T != target_N) jmp LookupHtab … mov [%eax], T jmp target_1’ Chains are built incrementally Most recent target inserted at the chain’s head Hash table is local to each indirect jump slow path C Improved prediction accuracy over existing schemes PLDI’05

Return-Address Prediction Distinguish different callers to a function by cloning: A’: if (T != A) jmp B’ … B’: F’(): pop T jmp A’ if (T != B) jmp Lookuphtab1 … A(): no cloning call F() ret F(): F_A’(): pop T jmp A’ A’: if (T != A) jmp Lookuphtab1 … B(): call F() cloning F_B’(): pop T jmp B’ B’: if (T != B) jmp Lookuphtab2 … C Prediction accuracy further improved PLDI’05

Pin Multithreading Support For instrumenting multithreaded programs: Pin intercepts all threading-related system calls: Create and start jitting a thread if a clone() is seen Pin provides a “thread id” for pintools to index thread-local storage Pin’s virtual registers are backed up by per-thread spilling area For writing multithreaded pintools: Since Pin cannot link in libpthread in the pintool (to avoid conflicts in setting up signal handlers by two libpthreads) Pin implements a subset of libpthread itself Pin can also redirect libpthread calls in pintool to the application’s libpthread PLDI’05

Instrumenting Multithreaded Programs Pin instruments multithreaded programs: Spilling area has to be thread local Create a new per-thread spilling area when a thread-create system call (e.g., clone()) is intercepted How to access to per-thread spilling area? Steal a physical register to point to the per-thread spilling area x86-specific optimization: Initially assuming single-threaded program Access to the spilling area via its absolute address If multiple threads detected later: Flush the code cache Recompile with a physical register pointing to per-thread spilling area PLDI’05

Optimizing Instrumentation Performance Observations: Slowdown largely due to executing instrumentation code rather than dynamic compilation Make sense to spend more time to optimize Focus on optimizing simple instrumentation tools: Performance depends on how fast we can transit between the application and the tool Simple yet commonly used (e.g., basic-block profiling) PLDI’05

Pin Source Code Organization Pin source organized into generic, architecture-dependent, OS-dependent modules: Architecture #source files #source lines Generic 87 (48%) 53595 (47%) x86 (32-bit + 64-bit) 34 (19%) 22794 (20%) Itanium 20474 (18%) ARM 27 (14%) 17933 (15%) TOTAL 182 (100%) 114796 (100%) C ~50% code shared among architectures PLDI’05

Pin Instrumentation Performance Performance of basic-block counting with Pin/IA32 Average slowdown INT FP Without optimization 10.4x 3.9x Inlining 7.8x 3.5x Inlining + eflags analysis 2.8x 1.5x Inlining + eflags analysis + scheduling 2.5x 1.4x PLDI’05

Comparison among Dynamic Instrumentation Tools Performance of basic-block counting with three different tools Valgrind is a popular instrumentation tool on Linux Call-based instrumentation, no inlining DynamoRIO is the performance leader in dynamic optimization Manually inline, no eflags liveness analysis and scheduling C Pin automatically provides efficient instrumentation PLDI’05

Pin/IA32 Performance (no instrumentation) PLDI’05

Pin/EM64T Performance (no instrumentation) PLDI’05

Pin0/IPF Performance (no instrumentation) PLDI’05