Performance Optimization of Pintools C K Luk. - 2 - Copyright © 2006 Intel Corporation. All Rights Reserved. Reducing Instrumentation Overhead Total Overhead.

Slides:

Advertisements

Similar presentations

Copyright © 2000, Daniel W. Lewis. All Rights Reserved. CHAPTER 10 SHARED MEMORY.

Advertisements

Instrumentation of Linux Programs with Pin Robert Cohn & C-K Luk Platform Technology & Architecture Development Enterprise Platform Group Intel Corporation.

Integrity & Malware Dan Fleck CS469 Security Engineering Some of the slides are modified with permission from Quan Jia. Coming up: Integrity – Who Cares?

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.

- 1 - Copyright © 2006 Intel Corporation. All Rights Reserved. Fault Analysis Using Pin Srilatha (Bobbie) Manne Intel.

Pin : Building Customized Program Analysis Tools with Dynamic Instrumentation Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff.

Pin PLDI Tutorial Advantages of Pin Instrumentation Easy-to-use Instrumentation: Uses dynamic instrumentation –Do not need source code, recompilation,

BINARY INSTRUMENTATION FOR HACKERS GAL DISKIN / INTEL HACK.LU

SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood.

- 1 - Copyright © 2006 Intel Corporation. All Rights Reserved. Techniques for Speeding up Pin-based Simulation Harish Patil.

Pipeline Enhancements for the Y86 Architecture

1 S. Tallam, R. Gupta, and X. Zhang PACT 2005 Extended Whole Program Paths Sriraman Tallam Rajiv Gupta Xiangyu Zhang University of Arizona.

Pin Tutorial Robert Cohn Intel. Pin Tutorial Academia Sinica About Me Robert Cohn –Original author of Pin –Senior Principal Engineer at Intel –Ph.D.

Software & Services Group PinADX: Customizable Debugging with Dynamic Instrumentation Gregory Lueck, Harish Patil, Cristiano Pereira Intel Corporation.

Pipelined Profiling and Analysis on Multi-core Systems Qin Zhao Ioana Cutcutache Weng-Fai Wong PiPA.

Aamer Jaleel Intel® Corporation, VSSAD June 17, 2006

Persistent Code Caching Exploiting Code Reuse Across Executions & Applications † Harvard University ‡ University of Colorado at Boulder § Intel Corporation.

1 ICS 51 Introductory Computer Organization Fall 2006 updated: Oct. 2, 2006.

Quiz Wei Hsu 8/16/2006. Which of the following instructions are speculative in nature? A)Data cache prefetch instruction B)Non-faulting loads C)Speculative.

Dynamic Binary Translation

San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation.

University of California San Diego Locality Phase Prediction Xipeng Shen, Yutao Zhong, Chen Ding Computer Science Department, University of Rochester Class.

University of Colorado

Pin2 Tutorial1 Pin Tutorial Kim Hazelwood Robert Muth VSSAD Group, Intel.

Pin PLDI Tutorial Kim Hazelwood David Kaeli Dan Connors Vijay Janapa Reddi.

CS 241 Section Week #4 (2/19/09). Topics This Section  SMP2 Review  SMP3 Forward  Semaphores  Problems  Recap of Classical Synchronization Problems.

About Us Kim Hazelwood Vijay Janapa Reddi

Software & Services Group 1 Pin: Intel’s Dynamic Binary Instrumentation Engine Pin Tutorial Intel Corporation Presented By: Tevi Devor CGO ISPASS 2012.

Day 3: Using Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.

GNU gcov (1/4) [from Wikipedia] gcov is a source code coverage analysis and statement- by-statement profiling tool. gcov generates exact counts of the.

PRINCIPLES OF OPERATING SYSTEMS Lecture 6: Processes CPSC 457, Spring 2015 May 21, 2015 M. Reza Zakerinasab Department of Computer Science, University.

Copyright 1995 by Coherence LTD., all rights reserved (Revised: Oct 97 by Rafi Lohev, Oct 99 by Yair Wiseman, Sep 04 Oren Kapah) IBM י ב מ 7-1 Measuring.

1 Dimension: An Instrumentation Tool for Virtual Execution Environments Jing Yang, Shukang Zhou and Mary Lou Soffa Department of Computer Science University.

Process Virtualization and Symbiotic Optimization Kim Hazelwood ACACES Summer School July 2009.

Pin: Intel’s Dynamic Binary Instrumentation Engine Pin Tutorial

- 1 - Copyright © 2006 Intel Corporation. All Rights Reserved. Using the Pin Instrumentation Tool for Computer Architecture Research Aamer Jaleel, Chi-Keung.

Pin Tutorial Kim Hazelwood David Kaeli Dan Connors Vijay Janapa Reddi.

1 Instrumentation of Intel® Itanium® Linux* Programs with Pin download: Robert Cohn MMDC Intel * Other names and brands.

1 Software Instrumentation and Hardware Profiling for Intel® Itanium® Linux* CGO’04 Tutorial 3/21/04 Robert Cohn, Intel Stéphane Eranian, HP CK Luk, Intel.

Processes and Threads CS550 Operating Systems. Processes and Threads These exist only at execution time They have fast state changes -> in memory and.

Dynamic Compilation and Modification CS 671 April 15, 2008.

Assembly Code Optimization Techniques for the AMD64 Athlon and Opteron Architectures David Phillips Robert Duckles Cse 520 Spring 2007 Term Project Presentation.

Scalable Support for Multithreaded Applications on Dynamic Binary Instrumentation Systems Kim Hazelwood Greg Lueck Robert Cohn.

Targeted Path Profiling : Lower Overhead Path Profiling for Staged Dynamic Optimization Systems Rahul Joshi, UIUC Michael Bond*, UT Austin Craig Zilles,

Instructor: Alexander Stoytchev CprE 185: Intro to Problem Solving (using C)

Optimised C/C++. Overview of DS General code Functions Mathematics.

Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

JIT Instrumentation – A Novel Approach To Dynamically Instrument Operating Systems Marek Olszewski Keir Mierle Adam Czajkowski Angela Demke Brown University.

Targeted Path Profiling : Lower Overhead Path Profiling for Staged Dynamic Optimization Systems Rahul Joshi, UIUC Michael Bond*, UT Austin Craig Zilles,

Optimization of C Code The C for Speed

Part Two: Optimizing Pintools Robert Cohn Kim Hazelwood.

Recursion. Problem decomposition Problem decomposition is a common technique for problem solving in programming – to reduce a large problem to smaller.

1 ROGUE Dynamic Optimization Framework Using Pin Vijay Janapa Reddi PhD. Candidate - Electrical And Computer Engineering University of Colorado at Boulder.

Tuning Threaded Code with Intel® Parallel Amplifier.

Profile Guided Code Positioning C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.

Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore

Pin: Intel’s Dynamic Binary Instrumentation Engine Pin Tutorial

Kim Hazelwood Robert Cohn Intel SPI-ST

Conditional Branch Example

Olatunji Ruwase* Shimin Chen+ Phillip B. Gibbons+ Todd C. Mowry*

Optimizing Compilers Background

PinADX: Customizable Debugging with Dynamic Instrumentation

Pin: Intel’s Dynamic Binary Instrumentation Engine Pin Tutorial

Advantages of Pin Instrumentation

Feedback directed optimization in Compaq’s compilation tools for Alpha

GNU gcov (1/4) [from Wikipedia]

X86 Assembly Review.

GNU gcov (1/4) [from Wikipedia]

Dynamic Binary Translators and Instrumenters

Presentation transcript:

Performance Optimization of Pintools C K Luk

- 2 - Copyright © 2006 Intel Corporation. All Rights Reserved. Reducing Instrumentation Overhead Total Overhead = Pin’s Overhead + Pintool’s Overhead The job of Pin developers to minimize this ~5% for SPECfp and ~20% for SPECint Pintool writers can help minimize this!

- 3 - Copyright © 2006 Intel Corporation. All Rights Reserved. Reducing Pintool’s Overhead Instrumentation Routines Overhead + Analysis Routines Overhead Pintool’s Overhead Frequency of calling an Analysis Routine x Work required in the Analysis Routine Work required for transiting to Analysis Routine + Work done inside Analysis Routine

- 4 - Copyright © 2006 Intel Corporation. All Rights Reserved. Reducing Frequency of Calling Analysis Routines Key: Key: –Instrument at the largest granularity whenever possible: Trace > Basic Block > Instruction

- 5 - Copyright © 2006 Intel Corporation. All Rights Reserved. Slower Instruction Counting sub$0xff, %edx cmp%esi, %edx jle mov$0x1, %edi add$0x10, %eax counter++;

- 6 - Copyright © 2006 Intel Corporation. All Rights Reserved. Faster Instruction Counting sub$0xff, %edx cmp%esi, %edx jle mov$0x1, %edi add$0x10, %eax counter += 3 counter += 2 Counting at BBL level sub$0xff, %edx cmp%esi, %edx jle mov$0x1, %edi add$0x10, %eax counter += 5 Counting at Trace level counter-=2 L1

- 7 - Copyright © 2006 Intel Corporation. All Rights Reserved. Reducing Work Done in Analysis Routines Key: Key: –Shifting computation from Analysis Routines to Instrumentation Routines whenever possible

- 8 - Copyright © 2006 Intel Corporation. All Rights Reserved. Edge Counting: a Slower Version … void docount2(ADDRINT src, ADDRINT dst, INT32 taken) { COUNTER *pedg = Lookup(src, dst); pedg->count += taken; } void Instruction(INS ins, void *v) { if (INS_IsBranchOrCall(ins)) { INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR) docount2, IARG_INST_PTR, IARG_BRANCH_TARGET_ADDR, IARG_BRANCH_TAKEN,IARG_END);}}…

- 9 - Copyright © 2006 Intel Corporation. All Rights Reserved. Edge Counting: a Faster Version void docount(COUNTER* pedge, INT32 taken) { pedg->count += taken; } void docount2(ADDRINT src, ADDRINT dst, INT32 taken) { COUNTER *pedg = Lookup(src, dst); pedg->count += taken; } void Instruction(INS ins, void *v) { if (INS_IsDirectBranchOrCall(ins)) { COUNTER *pedg = Lookup(INS_Address(ins), INS_DirectBranchOrCallTargetAddress(ins)); INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR) docount, IARG_ADDRINT, pedg, IARG_BRANCH_TAKEN, IARG_END); IARG_ADDRINT, pedg, IARG_BRANCH_TAKEN, IARG_END); } else INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR) docount2, IARG_INST_PTR, IARG_BRANCH_TARGET_ADDR, IARG_BRANCH_TAKEN, IARG_END); }…

Copyright © 2006 Intel Corporation. All Rights Reserved. Reducing Work required for Transiting to Analysis Routines Key: Key: –Help Pin’s optimizations applicable to your analysis routines: Inlining Scheduling

Copyright © 2006 Intel Corporation. All Rights Reserved. Inlining (1) Embed analysis routines directly into the application Embed analysis routines directly into the application –Avoid transiting through “bridges” Current limitation: Current limitation: –Only straight-line code can be inlined Future Pin version will inline analysis routines with control-flow changes Future Pin version will inline analysis routines with control-flow changes

Copyright © 2006 Intel Corporation. All Rights Reserved. Inlining (2) int docount0(int i) { x[i]++ return x[i]; } Inlinable int docount1(int i) { if (i == 1000) x[i]++; return x[i]; } Not-inlinable int docount2(int i) { x[i]++; printf(“%d”, i); return x[i]; } Not-inlinable void docount3() { for(i=0;i<100;i++) x[i]++; } Not-inlinable

Copyright © 2006 Intel Corporation. All Rights Reserved. Conditional Inlining Inline a common scenario where the analysis routine has a single “if-then” Inline a common scenario where the analysis routine has a single “if-then” –The “If” part is always executed –The “then” part is rarely executed Pintool writer breaks such an analysis routine into two: Pintool writer breaks such an analysis routine into two: –INS_InsertIfCall(ins, …, (AFUNPTR)doif, …) –INS_InsertThenCall(ins, …, (AFUNPTR)dothen, …)

Copyright © 2006 Intel Corporation. All Rights Reserved. IP-Sampling (a Slower Version) VOID Instruction(INS ins, VOID *v) { INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR)IpSample, IARG_INST_PTR, IARG_END); } VOID IpSample(VOID* ip) { --icount; if (icount == 0) { fprintf(trace, “%p\n”, ip); icount = N + rand()%M; //icount is between } const INT32 N = 10000; const INT32 M = 5000; INT32 icount = N;

Copyright © 2006 Intel Corporation. All Rights Reserved. IP-Sampling (a Faster Version) VOID Instruction(INS ins, VOID *v) { // CountDown() is always called before an inst is executed INS_InsertIfCall(ins, IPOINT_BEFORE, (AFUNPTR)CountDown, IARG_END); // PrintIp() is called only if the last call to CountDown() returns // a non-zero value INS_InsertThenCall(ins, IPOINT_BEFORE, (AFUNPTR)PrintIp, IARG_INST_PTR, IARG_END); } VOID PrintIp(VOID *ip) { fprintf(trace, “%p\n”, ip); icount = N + rand()%M; //icount is between } INT32 CountDown() { --icount; return (icount==0); } inlined not inlined

Copyright © 2006 Intel Corporation. All Rights Reserved. Scheduling of Instrumentation If an instrumentation can be inserted anywhere in a basic block: If an instrumentation can be inserted anywhere in a basic block: –Let Pin know via IPOINT_ANYWHERE –Pin will find the best point to insert the instrumentation to minimize register spilling

Copyright © 2006 Intel Corporation. All Rights Reserved. ManualExamples/inscount2.C #include #include #include "pin.H“ UINT64 icount = 0; void docount(INT32 c) { icount += c; } void Trace(TRACE trace, void *v) { for (BBL bbl = TRACE_BblHead(trace); BBL_Valid(bbl); bbl = BBL_Next(bbl)) { BBL_InsertCall(bbl, IPOINT_ANYWHERE, (AFUNPTR)docount, IARG_UINT32, BBL_NumIns(bbl), IARG_END); IARG_UINT32, BBL_NumIns(bbl), IARG_END); }} void Fini(INT32 code, void *v) { fprintf(stderr, "Count %lld\n", icount); } int main(int argc, char * argv[]) { PIN_Init(argc, argv); PIN_Init(argc, argv); TRACE_AddInstrumentFunction(Trace, 0); TRACE_AddInstrumentFunction(Trace, 0); PIN_AddFiniFunction(Fini, 0); PIN_AddFiniFunction(Fini, 0); PIN_StartProgram(); PIN_StartProgram(); return 0; return 0;}

Copyright © 2006 Intel Corporation. All Rights Reserved. Performance Comparison (Pin vs. Other Popular Tools) Runtime overhead of basic-block counting with 3 tools Valgrind* is a popular instrumentation tool on Linux* DynamoRIO* is a dynamic optimization system developed at MIT Pin is 3.3x faster than Valgrind* and 2x faster than DynamoRIO* ! Performance of Pin can be further improved by detaching to native after sufficient profiling is done! (Declaimer: mileage may vary)