1 Software Instrumentation and Hardware Profiling for Intel® Itanium® Linux* CGO’04 Tutorial 3/21/04 Robert Cohn, Intel Stéphane Eranian, HP CK Luk, Intel.

Slides:



Advertisements
Similar presentations
Memory Protection: Kernel and User Address Spaces  Background  Address binding  How memory protection is achieved.
Advertisements

Instrumentation of Linux Programs with Pin Robert Cohn & C-K Luk Platform Technology & Architecture Development Enterprise Platform Group Intel Corporation.
R4 Dynamically loading processes. Overview R4 is closely related to R3, much of what you have written for R3 applies to R4 In R3, we executed procedures.
Integrity & Malware Dan Fleck CS469 Security Engineering Some of the slides are modified with permission from Quan Jia. Coming up: Integrity – Who Cares?
1 Starting a Program The 4 stages that take a C++ program (or any high-level programming language) and execute it in internal memory are: Compiler - C++
Pin : Building Customized Program Analysis Tools with Dynamic Instrumentation Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff.
SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood.
Persistent Code Caching Exploiting Code Reuse Across Executions & Applications † Harvard University ‡ University of Colorado at Boulder § Intel Corporation.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
1 Chapter 7: Runtime Environments. int * larger (int a, int b) { if (a > b) return &a; //wrong else return &b; //wrong } int * larger (int *a, int *b)
Quiz Wei Hsu 8/16/2006. Which of the following instructions are speculative in nature? A)Data cache prefetch instruction B)Non-faulting loads C)Speculative.
1 Homework Reading –PAL, pp , Machine Projects –Finish mp2warmup Questions? –Start mp2 as soon as possible Labs –Continue labs with your.
San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
ARM C Language & Assembler. Using C instead of Java (or Python, or your other favorite language)? C is the de facto standard for embedded systems because.
University of Colorado
Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.
Pin2 Tutorial1 Pin Tutorial Kim Hazelwood Robert Muth VSSAD Group, Intel.
CCS APPS CODE COVERAGE. CCS APPS Code Coverage Definition: –The amount of code within a program that is exercised Uses: –Important for discovering code.
MIPS coding. SPIM Some links can be found such as:
JIT in webkit. What’s JIT See time_compilation for more info. time_compilation.
Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,
Process Virtualization and Symbiotic Optimization Kim Hazelwood ACACES Summer School July 2009.
Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design.
PMaC Performance Modeling and Characterization Performance Modeling and Analysis with PEBIL Michael Laurenzano, Ananta Tiwari, Laura Carrington Performance.
- 1 - Copyright © 2006 Intel Corporation. All Rights Reserved. Using the Pin Instrumentation Tool for Computer Architecture Research Aamer Jaleel, Chi-Keung.
Compiler Construction
CS533 Concepts of Operating Systems Jonathan Walpole.
1 Instrumentation of Intel® Itanium® Linux* Programs with Pin download: Robert Cohn MMDC Intel * Other names and brands.
Dynamic Compilation and Modification CS 671 April 15, 2008.
Languages and the Machine Chapter 5 CS221. Topics The Compilation Process The Assembly Process Linking and Loading Macros We will skip –Case Study: Extensions.
Topic 2d High-Level languages and Systems Software
COP4020 Programming Languages Subroutines and Parameter Passing Prof. Xin Yuan.
Scalable Support for Multithreaded Applications on Dynamic Binary Instrumentation Systems Kim Hazelwood Greg Lueck Robert Cohn.
Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.
RUN-Time Organization Compiler phase— Before writing a code generator, we must decide how to marshal the resources of the target machine (instructions,
1 CS503: Operating Systems Spring 2014 Part 0: Program Structure Dongyan Xu Department of Computer Science Purdue University.
Processes and Virtual Memory
Part Two: Optimizing Pintools Robert Cohn Kim Hazelwood.
Genesis: From Raw Hardware to Processes Andy Wang Operating Systems COP 4610 / CGS 5765.
1 CS/COE0447 Computer Organization & Assembly Language Chapter 2 Part 3.
Performance Optimization of Pintools C K Luk Copyright © 2006 Intel Corporation. All Rights Reserved. Reducing Instrumentation Overhead Total Overhead.
1 ROGUE Dynamic Optimization Framework Using Pin Vijay Janapa Reddi PhD. Candidate - Electrical And Computer Engineering University of Colorado at Boulder.
Hello world !!! ASCII representation of hello.c.
LECTURE 19 Subroutines and Parameter Passing. ABSTRACTION Recall: Abstraction is the process by which we can hide larger or more complex code fragments.
Program Execution in Linux David Ferry, Chris Gill CSE 522S - Advanced Operating Systems Washington University in St. Louis St. Louis, MO
Writing Functions in Assembly
Instruction Set Architecture
Introduction to Operating Systems
Computer Science 210 Computer Organization
Jason Puncher Software Designer Apriil 20, 2016
Program Execution in Linux
CS-3013 Operating Systems C-term 2008
Writing Functions in Assembly
The HP OpenVMS Itanium® Calling Standard
Computer Science 210 Computer Organization
CMSC 611: Advanced Computer Architecture
Chapter 9 :: Subroutines and Control Abstraction
Introduction to Operating Systems
Chap. 8 :: Subroutines and Control Abstraction
Chap. 8 :: Subroutines and Control Abstraction
Assembly Language Programming II: C Compiler Calling Sequences
PZ09A - Activation records
Computer Organization and Design Assembly & Compilation
Linking & Loading CS-502 Operating Systems
Program Execution in Linux
Linking & Loading CS-502 Operating Systems
Dynamic Binary Translators and Instrumenters
CSE 542: Operating Systems
Presentation transcript:

1 Software Instrumentation and Hardware Profiling for Intel® Itanium® Linux* CGO’04 Tutorial 3/21/04 Robert Cohn, Intel Stéphane Eranian, HP CK Luk, Intel Vijay Janapa Reddi, University of Colorado * Other names and brands may be claimed as the property of others

2 Agenda Instrumentation – Robert, Vijay Itanium Performance Monitoring Unit – CK Break Linux/IA64 Support for Performance Monitoring – Stéphane Guiding Ispike with Instrumentation and Hardware (PMU) Profiles - CK

3 Instrumentation of Intel® Itanium® Linux* Programs with Pin download: Robert Cohn Intel * Other names and brands may be claimed as the property of others

4 What Does Pin Stand For? Pin Is Not an acronym Pin is based on the post link optimizer spike –Use dynamic code generation to make a less intrusive profile guided optimization and instrumentation system –Pin is a small spike

5 Instrumentation Max = 0; for (p = head; p; p = p->next) { if (p->value > max) { max = p->value; } count[0]++; count[1]++; printf(“In Loop\n”); printf(“In max\n”); User defined Dynamic

6 What Can You Do With Instrumentation? Profiler for optimization: –Basic block count –Value profile Micro architectural study –Instrument branches to simulate branch predictor –Generate traces Bug checking –Find references to uninitialized, unallocated data

7 Classification of Instrumentation Tools Source Binary src to src compiler cc static dynamic JIT Editing Pin Dyninst Atom CIL See last slide for references and more examples

8 JIT-Based: Execution Drives Instrumentation ’ 2’ 1’ Compiler Original code Code cache

9 Execution Drives Instrumentation ’ 2’ 1’ Compiler Original code Code cache 3’ 5’ 6’

10 Inserting Instrumentation Relative to an instruction: 1.Before 2.After 3.Taken edge of branch L2: mov r9 = 4 br.ret count(3) count(100) count(105) mov r4 = 2 (p1) br.cond L2 add r3=8,r9

11 Analysis Routines Instead of inserting IPF instructions, user inserts calls to analysis routine –User specified arguments –E.g. Increment counter, record memory address, … Written in C, ASM, etc. Optimizations like inlining, register allocation, and scheduling make it efficient

12 Instrumentation Routines Instrumentation routine walks list of instructions, and inserts calls to analysis routines User writes instrumentation routine Pin invokes instrumentation routine when placing new instructions in code cache Repeated execution uses already instrumented code in code cache

13 Example: Instruction Count Tests]$ hello Hello world Tests]$ icount -- hello Hello world ICount Tests]$

14 Example: Instruction Count mov r2 = 2 add r3 = 4, r3 (p2) br.cond L1 add r4 = 8, r4 br.cond L2 counter++;

15 #include #include "pinstr.H" UINT64 icount=0; // Analysis Routine void docount() { icount++; } // Instrumentation Routine void Instruction(INS ins) { PIN_InsertCall(IPOINT_BEFORE, ins, (AFUNPTR)docount, IARG_END); } VOID Fini() { fprintf(stderr,"ICount %lld\n", icount); } int main(int argc, char *argv[]) { PIN_AddInstrumentInstructionFunction(Instruction); PIN_AddFiniFunction(Fini); PIN_StartProgram(); }

16 Example: Instruction Trace Trace]$ itrace -- hello Hello world Trace]$ head prog.trace 0x c0 0x c1 0x c2 0x d0 0x d2 0x e0 0x e1 0x e2 Trace]$

17 Example: Instruction Trace mov r2 = 2 add r3 = 4, r3 (p2) br.cond L1 add r4 = 8, r4 br.cond L2 traceInst(ip);

18 #include #include "pinstr.H" FILE *traceFile; void traceInst(long * ipsyll){ fprintf(traceFile, "%p\n", ipsyll); } void Instruction(INS ins){ PIN_InsertCall(IPOINT_BEFORE, ins, (AFUNPTR)traceInst, IARG_IP_SLOT, IARG_END); } int main(int argc, char *argv[]) { PIN_AddInstrumentInstructionFunction(Instruction); traceFile = fopen("prog.trace", "w"); PIN_StartProgram(); }

19 Arguments to Analysis Routine IARG_UINT8, …, IARG_UINT64 IARG_REG_VALUE IARG_IP_SLOT IARG_BRANCH_TAKEN IARG_BRANCH_TARGET_ADDRESS IARG_THREAD_ID IARG_IN_SIGNAL

20 More Advanced Tools Instruction cache simulation: replace itrace analysis function Data cache: like icache, but instrument loads/stores and pass effective address Malloc/Free trace: instrument entry/exit points Detect out of bound stack references –Instrument instructions that move stack pointer –Instrument loads/stores to check in bound

21 Example: Faster Instruction Count mov r2 = 2 add r3 = 4, r3 (p2) br.cond L1 add r4 = 8, r4 br.cond L2 counter++; counter += 3; counter += 2;

22 Sequences List of instructions that is only entered from top Program: mov r2 = 2 L2: add r3 = 4, r3 add r4 = 8, r4 br.cond L2 Sequence 1: mov r2 = 2 add r3 = 4, r3 add r4 = 8, r4 br.cond L2 Sequence 2: add r3 = 4, r3 add r4 = 8, r4 br.cond L2

23 void docount(UINT64 c) { icount += c; } void Sequence(INS head) { INS ins; INS last = INS_INVALID(); UINT64 count = 0; for (ins = head; ins != INS_INVALID(); ins = INS_Next(ins)) { count++; switch(INS_Category(ins)) { case TYPE_CAT_BRANCH: case TYPE_CAT_CBRANCH: case TYPE_CAT_JUMP: case TYPE_CAT_CJUMP: case TYPE_CAT_CHECK: case TYPE_CAT_BREAK: PIN_InsertCall(IPOINT_BEFORE, ins, (AFUNPTR)docount, IARG_UINT64, count, IARG_END); count = 0; break; } last = ins; } PIN_InsertCall(IPOINT_AFTER, last, (AFUNPTR)docount, IARG_UINT64, count, IARG_END); }

24 Instruction Information Accessed at Instrumentation Time 1.INS_Category(INS) 2.INS_Address(INS) 3.INS_Regr1, INS_Regr2, INS_Regr3, … 4.INS_Next(INS), INS_Prev(INS) 5.INS_BraType(INS) 6.INS_SizeType(INS) 7.INS_Stop(INS)

25 Callbacks Call backs for instrumentation –PIN_AddInstrumentInstructionFunction –PIN_AddInstrumentSequenceFunction Other callbacks –PIN_AddImageLoadFunction –PIN_AddImageUnloadFunction –PIN_AddThreadBeforeFunction –PIN_AddThreadAfterFunction –PIN_AddFiniFunction: last thread exits

26 Instrumentation is Transparent When application looks at itself, sees same: –Code addresses –Data addresses –Memory contents  Don’t want to change behavior, expose latent bugs When instrumentation looks at application, sees original application: –Code addresses –Data addresses –Memory contents  Observe original behavior

27 Pin Instruments All Code Execution driven instrumentation: –Shared libraries –Dynamically generated code Self modifying code –Instrumented first time executed –Pin does not detect code has been modified

28 Dynamic Instrumentation While program is running: –Instrumentation can be turned on/off –Code cache can be invalidated –Reinstrumented the next time it is executed –Pin can detach and run application native Use this for fast skip

29 Making Instrumentation Fast Use sequences to reduce analysis calls Do work at instrumentation-time –Instrumentation functions executed once –Analysis function executed every time instruction is executed PIN_InsertCall(ins, IPOINT_BEFORE, Fun, Lookup(INS_Address(ins)), IARG_END); Fun(BUCKET *){… Better than: PIN_InsertCall(ins, IPOINT_BEFORE, Fun, IARG_IP, IARG_END); Fun(UINT64 ip){ Lookup(ip); …

30 Future Support for other ISA’s –x86 –Arm Attach to running process

31 Advanced Topics Symbol table Altering program behavior Threads Debugging Jitting

32 Symbol Table/Image Query: –Address  symbol name –Address  image name (e.g. libc.so) –Address  source file, line number Instrumentation: –Procedure before/after PIN_InsertCall(IPOINT_BEFORE, Sym, Afun, IARG_REG_VALUE, REG_REG_GP_ARG0, IARG_END) Before: at entry point After: immediately before return is executed –Catch image load/unload

33 Alter Program Behavior Analysis routines can write application memory Replace one procedure with another –E.g. replace library calls –PIN_ReplaceProcedureByAddress(address, funptr); –Replaces function at address with funptr

34 Alter Program Behavior Change values of registers: –PIN_InsertCall(IPOINT_BEFORE, ins, zap, IARG_RETURN_VALUES, IARG_REG_VALUE, REG_G03, IARG_END); –Return value of function zap is written to r3

35 Alter Program Behavior Change instructions: Ld8 r3=[r4] Becomes: Ld8 r3=[r9] –INS_RegR1Set(ins, REG_G09) Pin provides virtual registers for instrumentation: –REG_INST_GP0 – REG_INST_GP9

36 Threads Pin is thread safe Pin assumes your tool is not thread safe –Tell pin how many threads your tool can handle –PIN_SetMaxThreads(INT) Make your tool thread safe –Instrumentation code: guarded by single lock –Analysis code – protect global data structures Lock –Use pin provided routines, not pthreads Thread local storage –Use IARG_THREAD_ID to index into array

37 Debugging Instrumentation code –Use gdb Analysis code –Pin dynamically optimizes analysis code to reduce cost of instrumentation (up to 10x) –Disable optimization to use gdb icount –p pin –O0 -- /bin/ls –Otherwise, use printf

38 Jitting Fetch Instrument Translate Stub Allocate registers Generate code

39 Fetch ’ 2’ 1’ Original code Code cache

40 Instrument ld8 r36=[r14] mov r32=r1 nop.m 0x0 mov r38 = r2 br.call b0=print alloc r34=ar.pfs,6,4,0 mov r35=r12 adds r2=-16,r1 ….. br.ret.sptk.many b0;; Application Setup call frame Setup arguments Call Undo call frame return Call BridgeInstrumentation Save scratch Restore scratch

41 Translate Jitting does not change architectural behavior –Registers/memory –Timing is not architectural Change instructions to preserve old behavior

42 Translate New Code is not at same address 0x4000: mov b1 = ip  movl rscr1 = 0x4000; mov b1 = rscr1 Branches always target code cache br.cond 0x4010  br.cond 0x5030 New Code is not at same address/ Branches always target code cache 0x4000: br.call.many b0 = 0x4100  movl rscr1 = 0x4010; mov b0 = rscr1; br.call bscr1 = 0x5010 Indirect targets resolved at runtime br.ret b0  movl rscr1 = 0x4010; mov rscr2 = b0; cmpne p1,p0=rscr1, rscr2; br.cond (p1)

43 Stub ’ 2’ 1’ Compiler Original code Code cache 3’ 5’ 6’ stub

44 Instrumentation/Hardware Counters Instrumentation is better when: –Exact profile needed code coverage must distinguish 0/1 execution –No hardware to measure event Modeled cache does not match current hardware Store followed by load to the same address But instrumentation may: –Be too slow compared to sampling/counting –Change the behavior of program by slowing it down –Not be able to observe microarchitectural events

45 More Reading on Instrumentation Systems: Static: Source: CIL: Binary: Atom: static instrumentation for Alpha Papers: Manual: ITLE.HTM Etch: static instrumentation of x86 Windows, EEL: machine independent executable editing, Vulcan: ftp://ftp.research.microsoft.com/pub/tr/tr pdf Dynamic: JIT based: Pin: dynamic instrumentation ipf Linux, DynamoRio x86 Linux and Windows, Diota: dynamic instrumentation of x86 Linux: Editing based: Dyninst: dynamic code generation api for multiple architectures/OS, Caliper: itanium HP/UX Vulcan: ftp://ftp.research.microsoft.com/pub/tr/tr pdf