SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood.

Slides:



Advertisements
Similar presentations
Programming Technologies, MIPT, April 7th, 2012 Introduction to Binary Translation Technology Roman Sokolov SMWare
Advertisements

Instrumentation of Linux Programs with Pin Robert Cohn & C-K Luk Platform Technology & Architecture Development Enterprise Platform Group Intel Corporation.
Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,
Comprehensive Kernel Instrumentation via Dynamic Binary Translation Peter Feiner, Angela Demke Brown, Ashvin Goel University of Toronto Presenter: Chuong.
Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.
Processes and Threads Prof. Sirer CS 4410 Cornell University.
Pin : Building Customized Program Analysis Tools with Dynamic Instrumentation Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Chapter 6 Limited Direct Execution
Pipelined Profiling and Analysis on Multi-core Systems Qin Zhao Ioana Cutcutache Weng-Fai Wong PiPA.
Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.
San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation.
1 School of Computing Science Simon Fraser University CMPT 300: Operating Systems I Ch 4: Threads Dr. Mohamed Hefeeda.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
University of Colorado
Scheduler Activations Jeff Chase. Threads in a Process Threads are useful at user-level – Parallelism, hide I/O latency, interactivity Option A (early.
Pin2 Tutorial1 Pin Tutorial Kim Hazelwood Robert Muth VSSAD Group, Intel.
Previous Next 06/18/2000Shanghai Jiaotong Univ. Computer Science & Engineering Dept. C+J Software Architecture Shanghai Jiaotong University Author: Lu,
Chapter 4: Threads. 4.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Threads A thread (or lightweight process) is a basic unit of CPU.
Lecture 8. Profiling - for Performance Analysis - Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture &
Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,
CSC 501 Lecture 2: Processes. Process Process is a running program a program in execution an “instantiation” of a program Program is a bunch of instructions.
1 Dimension: An Instrumentation Tool for Virtual Execution Environments Jing Yang, Shukang Zhou and Mary Lou Soffa Department of Computer Science University.
Process Virtualization and Symbiotic Optimization Kim Hazelwood ACACES Summer School July 2009.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Threads and Processes.
CS 390- Unix Programming Environment CS 390 Unix Programming Environment Topics to be covered: Distributed Computing Fundamentals.
PMaC Performance Modeling and Characterization Performance Modeling and Analysis with PEBIL Michael Laurenzano, Ananta Tiwari, Laura Carrington Performance.
- 1 - Copyright © 2006 Intel Corporation. All Rights Reserved. Using the Pin Instrumentation Tool for Computer Architecture Research Aamer Jaleel, Chi-Keung.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto OS-Related Hardware.
1 Instrumentation of Intel® Itanium® Linux* Programs with Pin download: Robert Cohn MMDC Intel * Other names and brands.
Processes and Threads CS550 Operating Systems. Processes and Threads These exist only at execution time They have fast state changes -> in memory and.
Dynamic Compilation and Modification CS 671 April 15, 2008.
© 2004, D. J. Foreman 2-1 Concurrency, Processes and Threads.
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
CS 346 – Chapter 4 Threads –How they differ from processes –Definition, purpose Threads of the same process share: code, data, open files –Types –Support.
Instrumentation in Software Dynamic Translators for Self-Managed Systems Bruce R. Childers Naveen Kumar, Jonathan Misurda and Mary.
Scalable Support for Multithreaded Applications on Dynamic Binary Instrumentation Systems Kim Hazelwood Greg Lueck Robert Cohn.
Source: Operating System Concepts by Silberschatz, Galvin and Gagne.
The Mach System Abraham Silberschatz, Peter Baer Galvin, Greg Gagne Presentation By: Agnimitra Roy.
Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.
Concurrency, Processes, and System calls Benefits and issues of concurrency The basic concept of process System calls.
Full and Para Virtualization
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Processes and Threads.
Part Two: Optimizing Pintools Robert Cohn Kim Hazelwood.
Thread basics. A computer process Every time a program is executed a process is created It is managed via a data structure that keeps all things memory.
Efficient software-based fault isolation Robert Wahbe, Steven Lucco, Thomas Anderson & Susan Graham Presented by: Stelian Coros.
1 Process Description and Control Chapter 3. 2 Process A program in execution An instance of a program running on a computer The entity that can be assigned.
Slides created by: Professor Ian G. Harris Operating Systems  Allow the processor to perform several tasks at virtually the same time Ex. Web Controlled.
Performance Optimization of Pintools C K Luk Copyright © 2006 Intel Corporation. All Rights Reserved. Reducing Instrumentation Overhead Total Overhead.
1 ROGUE Dynamic Optimization Framework Using Pin Vijay Janapa Reddi PhD. Candidate - Electrical And Computer Engineering University of Colorado at Boulder.
CISC2200 Threads Fall 09. Process  We learn the concept of process  A program in execution  A process owns some resources  A process executes a program.
Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore
CMSC 421 Spring 2004 Section 0202 Part II: Process Management Chapter 5 Threads.
Sung-Dong Kim, Dept. of Computer Engineering, Hansung University Java - Introduction.
Introduction to Operating Systems Concepts
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Introduction to Kernel
Protection of System Resources
Lecture Topics: 11/1 Processes Process Management
Threads and Data Sharing
Lecture Topics: 11/1 General Operating System Concepts Processes
Processes Hank Levy 1.
Programming with Shared Memory
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
Chapter 3: Processes.
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
Processes Creation and Threads
Processes Hank Levy 1.
Dynamic Binary Translators and Instrumenters
Presentation transcript:

SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood

Hazelwood – CGO Dynamic Binary Instrumentation Inserts user-defined instructions into executing binaries Easily Efficiently Transparently Why? Detect inefficiencies Detect bugs Security checks Add features Examples Valgrind, DynamoRIO, Strata, HDTrans, Pin EXE Transform Code Cache Execute Profile

Hazelwood – CGO Intel Pin A dynamic binary instrumentation system Easy-to-use instrumentation interface Supports multiple platforms –Four ISAs – IA32, Intel64, IPF, ARM –Four OSes – Linux, Windows, FreeBSD, MacOS Robust and stable (Pin can run itself!) –12+ active developers –Nightly testing of binaries on 15 platforms –Large user base in academia and industry –Active mailing list (Pinheads) 11,500 downloads

Hazelwood – CGO Our Goal: Improve Performance The latest Pin overhead numbers …

Hazelwood – CGO Adding Instrumentation

Hazelwood – CGO Sources of Overhead Internal Compiling code & exit stubs (region detection, region formation, code generation) Managing code (eviction, linking) Managing directories and performing lookups Maintaining consistency (SMC, DLLs) External User-inserted instrumentation

Hazelwood – CGO “Normal Pin” Execution Flow Instrumentation is interleaved with application Uninstrumented Application Instrumented Application Pin Overhead Instrumentation Overhead “Pinned” Application time 

Hazelwood – CGO “SuperPin” Execution Flow SuperPin creates instrumented slices Uninstrumented Application SuperPinned Application Instrumented Slices

Hazelwood – CGO Issues and Design Decisions Creating slices How/when to start a slice How/when to end a slice System calls Merging results

Hazelwood – CGO fork S6+ fork S5+ fork S4+ fork record sigr3, sleep S3+ fork S2+ sleep S1+ Execution Timeline fork S1 S2S3S4S5S6 detect sigr4 detect exit resume detect sigr3 detect sigr6 detect sigr2 detect sigr5 resume record sigr4, sleep CPU2 CPU3 CPU4 time record sigr2, sleep resume record sigr5, sleep resume record sigr6, sleep resume original application instrumented application slices CPU1

Hazelwood – CGO Starting Slices How? Master application process – ptrace Controlling process Child slices – fork –Reserve memory for transparency –Each slice has its own code cache (for now) When? Timeouts –Uses a special timer process –Tunable parameter System calls

Hazelwood – CGO S1 Handling System Calls Problem: Don’t want to duplicate system calls in the main application/slices Solutions: brk or anonymous mmap – duplication OK frequent calls – record and playback default – trigger new timeslice S2 S1 Instr A Instr B SysCall Instr C Instr D Instr A Instr B MMAP Instr C Instr D S1 Instr A Instr B SysCallPlay Instr C Instr D DuplicateRecord/PlaybackEnd Slice

Hazelwood – CGO Ending Slices Each slice is responsible for detecting its own end-of-slice condition S4+ record sig4, sleep resume detect sig5 Challenges: Need to efficiently capture a point in time (signature) Need to efficiently detect when we’ve reached that point

Hazelwood – CGO Signature Detection End-of-slice conditions: 1.System calls – easy to detect 2.Timeouts at arbitrary points – harder to detect Signature match: Instruction pointer Architectural state Top of stack

Hazelwood – CGO Implementing Signature Detection Uses Pin’s lightweight conditional analysis INS_InsertIfCall – lightweight inlined check INS_InsertThenCall – heavyweight (conditional) analysis routine Instrument the end-of-slice instruction pointer 1.Lightweight check – two registers 2.Heavyweight check – full architectural state 3.Heavyweight check – top 100 words on the stack Lightweight triggers heavyweight: ~2% Stack check fails: ~0%

Hazelwood – CGO Performance Results Icount1 – Instruments every instruction with count++ % pin –t icount1 --./hello Hello CGO Count: Icount2 – Instruments every basic block with count += bblength % pin –t icount2 --./hello Hello CGO Count:

Hazelwood – CGO Performance – icount1 % pin –t icount1 --

Hazelwood – CGO Performance – icount2 % pin –t icount2 --

Hazelwood – CGO Performance Scalability Running on an 8-processor HT-enabled machine (16 virtual processors)

Hazelwood – CGO Overhead Categorization Where is SuperPin spending its time? Executing the application Fork overheads Sleeping (waiting to start a slice) Pipeline delays fork S1 S2S3S4 record sigr2, sleep detect sigr3 resume record sigr4, sleep detect exit resume sleep detect sigr2 resume detect sigr4 record sigr3, sleep resume CPU3 S2+S4+ S1+ S3+

Hazelwood – CGO Overhead Categorization

Hazelwood – CGO The SuperPin API You may write SuperPin-aware Pintools: –SP_Init(fun) –SP_AddSliceBeginFunction(fun,val) –SP_AddSliceEndFunction(fun,val) –SP_EndSlice() –SP_CreateSharedArea(local,size,merge) You may also control (via switches): –Spmsec {value}: milliseconds per timeslice –Spmp {value}: maximum slice count –Spsysrecs {value}: maximum syscalls per slice

Hazelwood – CGO Pin Instrumentation API – icount2 VOID DoCount(INT32 c) { icount += c;} VOID Trace(TRACE trace, VOID *v) { for (BBL bbl=Head(trace); Valid(bbl); bbl=Next(bbl)) { INS_InsertCall(BBL_InsHead(bbl), IPOINT_BEFORE, (AFUNPTR)DoCount, IARG_INT32, BBL_NumIns(bbl), IARG_END); } int main(INT32 argc, CHAR **argv) { PIN_Init(argc, argv); TRACE_AddInstrumentFunction(Trace); PIN_StartProgram(); return 0; }

Hazelwood – CGO SuperPin Version of Icount2 VOID DoCount(INT32 c) {//same as before} VOID ToolReset(INT32 c) { icount = 0;} VOID Merge(INT32 sliceNum) { *sharedData += icount;} VOID Trace(TRACE trace, VOID *v) {//same as before} int main(INT32 argc, CHAR **argv) { PIN_Init(argc, argv); SP_Init(ToolReset); sharedData = (UINT64*) SP_CreateSharedArea(&icount, sizeof(icount),0); SP_AddSliceEndFunction(Merge); TRACE_AddInstrumentFunction(Trace); PIN_StartProgram(); return 0; }

Hazelwood – CGO SuperPin Limitations Not all instrumentation tasks are a good fit Great fit – independent tasks Instruction profiling (counts, distributions) Trace generation Requires massaging – dependent tasks Branch prediction Data cache simulation –Assume a starting state, resolve later Stick with regular Pin – path modification Adaptive execution

Hazelwood – CGO Future (Rainy Day) Extensions Adaptive parallelism detection –Hardware feedback: adapts to available processors –OS feedback: adapts to present load Adaptive slice timeouts Slice-shared code caches Multithreaded application support

Hazelwood – CGO SuperPin Summary Allows users to leverage available parallelism for certain instrumentation tasks Hides the gory details Enables significant speedup (for the right tasks … on the right machines) Exposed as Pin API extensions  Download it today!

Hazelwood – CGO Thank You

Hazelwood – CGO Merging Results Each slice is a separate process  must merge slice-local results into a collective total Merge routines: Addition Concatenation User-defined Executed in program order Use shared memory