Shadow Profiling: Hiding Instrumentation Costs with Parallelism Tipp Moseley Alex Shye Vijay Janapa Reddi Dirk Grunwald (University of Colorado) Ramesh.

Slides:



Advertisements
Similar presentations
Paging: Design Issues. Readings r Silbershatz et al: ,
Advertisements

Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.
Pipelined Profiling and Analysis on Multi-core Systems Qin Zhao Ioana Cutcutache Weng-Fai Wong PiPA.
Continuously Recording Program Execution for Deterministic Replay Debugging.
Transient Fault Tolerance via Dynamic Process-Level Redundancy Alex Shye, Vijay Janapa Reddi, Tipp Moseley and Daniel A. Connors University of Colorado.
Architectural Support for OS March 29, 2000 Instructor: Gary Kimura Slides courtesy of Hank Levy.
G Robert Grimm New York University Disco.
CS 300 – Lecture 22 Intro to Computer Architecture / Assembly Language Virtual Memory.
Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.
OS Spring’03 Introduction Operating Systems Spring 2003.
Translation Buffers (TLB’s)
CSE451 Processes Spring 2001 Gary Kimura Lecture #4 April 2, 2001.
Processes April 5, 2000 Instructor: Gary Kimura Slides courtesy of Hank Levy.
Code Coverage Testing Using Hardware Performance Monitoring Support Alex Shye, Matthew Iyer, Vijay Janapa Reddi and Daniel A. Connors University of Colorado.
CSE 451: Operating Systems Autumn 2013 Module 6 Review of Processes, Kernel Threads, User-Level Threads Ed Lazowska 570 Allen.
Paging. Memory Partitioning Troubles Fragmentation Need for compaction/swapping A process size is limited by the available physical memory Dynamic growth.
Operating System Support for Virtual Machines Samuel T. King, George W. Dunlap,Peter M.Chen Presented By, Rajesh 1 References [1] Virtual Machines: Supporting.
CSC 501 Lecture 2: Processes. Process Process is a running program a program in execution an “instantiation” of a program Program is a bunch of instructions.
Implementing Processes and Process Management Brian Bershad.
PMaC Performance Modeling and Characterization Performance Modeling and Analysis with PEBIL Michael Laurenzano, Ananta Tiwari, Laura Carrington Performance.
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
1 Computer Systems II Introduction to Processes. 2 First Two Major Computer System Evolution Steps Led to the idea of multiprogramming (multiple concurrent.
4300 Lines Added 1800 Lines Removed 1500 Lines Modified PER DAY DURING SUSE Lab.
1 ROGUE Dynamic Optimization Framework Using Pin Vijay Janapa Reddi PhD. Candidate - Electrical And Computer Engineering University of Colorado at Boulder.
Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore
PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,
Speculative execution Landon Cox April 13, Making disk accesses tolerable Basic idea Remove disk accesses from critical path Transform disk latencies.
Memory Protection: Kernel and User Address Spaces Andy Wang Operating Systems COP 4610 / CGS 5765.
CS161 – Design and Architecture of Computer
CMSC 611: Advanced Computer Architecture
Processes and threads.
Process Management Process Concept Why only the global variables?
CS161 – Design and Architecture of Computer
Memory Protection: Kernel and User Address Spaces
CS352H: Computer Systems Architecture
Chapter 9: Virtual Memory – Part I
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Chapter 9 – Real Memory Organization and Management
Morgan Kaufmann Publishers
/ Computer Architecture and Design
Memory Protection: Kernel and User Address Spaces
Memory Protection: Kernel and User Address Spaces
Memory Protection: Kernel and User Address Spaces
Page Replacement.
Operating Systems Lecture 6.
Adaptive Code Unloading for Resource-Constrained JVMs
Process & its States Lecture 5.
Perfctr-Xen: A framework for Performance Counter Virtualization
Architectural Support for OS
Processes Hank Levy 1.
Translation Buffers (TLB’s)
Processes and Process Management
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
CSE 451: Operating Systems Autumn 2003 Lecture 2 Architectural Support for Operating Systems Hank Levy 596 Allen Center 1.
CSE 451: Operating Systems Autumn 2001 Lecture 2 Architectural Support for Operating Systems Brian Bershad 310 Sieg Hall 1.
Speculative execution and storage
Translation Buffers (TLB’s)
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
CSE 451: Operating Systems Winter 2003 Lecture 2 Architectural Support for Operating Systems Hank Levy 412 Sieg Hall 1.
Architectural Support for OS
Virtual Memory: Working Sets
Processes Hank Levy 1.
Translation Buffers (TLBs)
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Review What are the advantages/disadvantages of pages versus segments?
Memory Protection: Kernel and User Address Spaces
Dynamic Binary Translators and Instrumenters
Presentation transcript:

Shadow Profiling: Hiding Instrumentation Costs with Parallelism Tipp Moseley Alex Shye Vijay Janapa Reddi Dirk Grunwald (University of Colorado) Ramesh Peri (Intel Corporation)

Motivation An ideal profiler will… 1. Collect arbitrarily detailed and abundant information 2. Incur negligible overhead A real profiler, e.g., using Pin, satisfies condition 1 But the cost is high 3X for BBL counting 25X for loop profiling 50X or higher for memory profiling A real profiler, e.g. PMU sampling or code patching, satisfies condition 2 But the detail is very coarse

Motivation Low Overhead High Detail VTune, DCPI, OProfile, PAPI, pfmon, PinProbes, … Pintools, Valgrind, ATOM, … “Bursty Tracing” (Sampled Instrumentation), Novel Hardware, Shadow Profiling

Goal To create a profiler capable of collecting detailed, abundant information while incurring negligible overhead Enable developers to focus on other things

The Big Idea Stems from fault tolerance work on deterministic replication Periodically fork(), profile “shadow” processes TimeCPU 0CPU 1CPU 2CPU 3 0Orig. Slice 0Slice 0 1Orig. Slice 1Slice 0Slice 1 2Orig. Slice 2Slice 0Slice 1Slice 2 3Orig. Slice 3Slice 3Slice 1Slice 2 4Orig. Slice 4Slice 3Slice 4Slice 2 5Slice 3Slice 4 6 * Assuming instrumentation overhead of 3X

Challenges Threads Shared Memory Asynchronous Interrupts System Calls JIT overhead Overhead vs. Number of CPUs Maximum speedup is Number of CPUs If profiler overhead is 50X, need at least 51 CPUs to run in real-time (probably many more) Too many complications to ensure deterministic replication

Goal (Revised) To create a profiler capable of sampling detailed traces (bursts) with negligible overhead Trade abundance for low overhead Like SimPoints or SMARTS (but not as smart :)

The Big Idea (revised) TimeCPU 0CPU 1CPU 2CPU 3 0Orig. Slice 0Slice 0Spyware 1Orig. Slice 1Slice 0Spyware 2Orig. Slice 2Slice 0Slice 1Spyware 3Orig. Slice 3Slice 1Spyware 4Orig. Slice 4Slice 1Spyware Do not strive for full, deterministic replica Instead, profile many short, mostly deterministic bursts Profile a fixed number of instructions “Fake it” for system calls Must not allow shadow to side-effect system

Design Overview

Monitor uses Pin Probes (code patching) Application runs natively Monitor receives periodic timer signal and decides when to fork() After fork(), child uses PIN_ExecuteAt() functionality to switch Pin from Probe to JIT mode. Shadow process profiles as usual, except handling of special cases Monitor logs special read() system calls and pipes result to shadow processes

System Calls For SPEC CPU2000, system calls occur around 35 times per second Forking after each puts lots of pressure on CoW pages, Pin JIT engine 95% of dynamic system calls can be safely handled Some system calls can be allowed to execute (49%) getrusage, _llseek, times, time, brk, munmap, fstat64, close, stat64, umask, getcwd, uname, access, exit_group, …

System Calls Some can be replaced with success assumed (39%) write, ftruncate, writev, unlink, rename, … Some are handled specially, but execution may continue (1.8%) mmap2, open(creat), mmap, mprotect, mremap, fcntl read() is special (5.4%) For reads from pipes/sockets, the data must be logged from the original app For reads from files, the file must be closed and reopened after the fork() because the OS file pointer is not duplicated ioctl() is special (4.8%) Frequent in perlbmk Behavior is device-dependent, safest action is to simply terminate the segment and re-fork()

Other Issues Shared Memory Disallow writes to shared memory Asynchronous Interrupts (Userspace signals) Since we are only mostly deterministic, no longer an issue When main program receives a signal, pass it along to live children JIT Overhead After each fork(), it is like Pinning a new program Warmup is too slow Use Persistent Code Caching [CGO’07]

Multithreaded Programs Issue:fork() does not duplicate all threads Only the thread that called fork() Solution: 1. Barrier all threads in the program and store their CPU state 2. Fork the process and clone new threads for those that were destroyed Identical address space; only register state was really ‘lost’ 3. In each new thread, restore previous CPU state Modified clone() handling in Pin VM 4. Continue execution, virtualize thread IDs for relevant system calls

Tuning Overhead Load Number of active shadow processes Tested 0.125, 0.25, 0.5, 1.0, 2.0 Sample Size Number of instructions to profile Longer samples for less overhead, more data Shorter samples for more evenly dispersed data Tested 1M, 10M, 100M

Experiments Value Profiling Typical overhead ~100X Accuracy measured by Difference in Invariance Path Profiling Typical overhead 50% - 10X Accuracy measured by percent of hot paths detected (2% threshold) All experiments use SPEC2000 INT Benchmarks with “ref” data set Arithmetic mean of 3 runs presented

Results - Value Profiling Overhead Overhead versus native execution Several configurations less than 1% Path profiling exhibits similar trends

Results - Value Profiling Accuracy All configurations within 7% of perfect profile Lower is better

Results - Path Profiling Accuracy Most configurations over 90% accurate Higher is better Some benchmarks (e.g., 176.gcc, 186.crafty, 187.parser) have millions of paths, but few are “hot”

Results - Page Fault Increase Proportional increase in page faults Shadow/Native

Results - Page Fault Rate Difference in page faults per second experienced by native application

Future Work Improve stability for multithreaded programs Investigate effects of different persistent code cache policies Compare sampling policies Random (current) Phase/event-based Static analysis Study convergence Apply technique Profile-guided optimizations Simulation techniques

Conclusion Shadow Profiling allows collection of bursts of detailed traces Accuracy is over 90% Incurs negligible overhead Often less than 1% With increasing numbers of cores, allows developers’ focus to shift from profiling to applying optimizations