Presentation is loading. Please wait.

Presentation is loading. Please wait.

Shadow Profiling: Hiding Instrumentation Costs with Parallelism Tipp Moseley Alex Shye Vijay Janapa Reddi Dirk Grunwald (University of Colorado) Ramesh.

Similar presentations

Presentation on theme: "Shadow Profiling: Hiding Instrumentation Costs with Parallelism Tipp Moseley Alex Shye Vijay Janapa Reddi Dirk Grunwald (University of Colorado) Ramesh."— Presentation transcript:

1 Shadow Profiling: Hiding Instrumentation Costs with Parallelism Tipp Moseley Alex Shye Vijay Janapa Reddi Dirk Grunwald (University of Colorado) Ramesh Peri (Intel Corporation)

2 Motivation An ideal profiler will… 1. Collect arbitrarily detailed and abundant information 2. Incur negligible overhead A real profiler, e.g., using Pin, satisfies condition 1 But the cost is high 3X for BBL counting 25X for loop profiling 50X or higher for memory profiling A real profiler, e.g. PMU sampling or code patching, satisfies condition 2 But the detail is very coarse

3 Motivation Low Overhead High Detail VTune, DCPI, OProfile, PAPI, pfmon, PinProbes, … Pintools, Valgrind, ATOM, … “Bursty Tracing” (Sampled Instrumentation), Novel Hardware, Shadow Profiling

4 Goal To create a profiler capable of collecting detailed, abundant information while incurring negligible overhead Enable developers to focus on other things

5 The Big Idea Stems from fault tolerance work on deterministic replication Periodically fork(), profile “shadow” processes TimeCPU 0CPU 1CPU 2CPU 3 0Orig. Slice 0Slice 0 1Orig. Slice 1Slice 0Slice 1 2Orig. Slice 2Slice 0Slice 1Slice 2 3Orig. Slice 3Slice 3Slice 1Slice 2 4Orig. Slice 4Slice 3Slice 4Slice 2 5Slice 3Slice 4 6 * Assuming instrumentation overhead of 3X

6 Challenges Threads Shared Memory Asynchronous Interrupts System Calls JIT overhead Overhead vs. Number of CPUs Maximum speedup is Number of CPUs If profiler overhead is 50X, need at least 51 CPUs to run in real-time (probably many more) Too many complications to ensure deterministic replication

7 Goal (Revised) To create a profiler capable of sampling detailed traces (bursts) with negligible overhead Trade abundance for low overhead Like SimPoints or SMARTS (but not as smart :)

8 The Big Idea (revised) TimeCPU 0CPU 1CPU 2CPU 3 0Orig. Slice 0Slice 0Spyware 1Orig. Slice 1Slice 0Spyware 2Orig. Slice 2Slice 0Slice 1Spyware 3Orig. Slice 3Slice 1Spyware 4Orig. Slice 4Slice 1Spyware Do not strive for full, deterministic replica Instead, profile many short, mostly deterministic bursts Profile a fixed number of instructions “Fake it” for system calls Must not allow shadow to side-effect system

9 Design Overview

10 Monitor uses Pin Probes (code patching) Application runs natively Monitor receives periodic timer signal and decides when to fork() After fork(), child uses PIN_ExecuteAt() functionality to switch Pin from Probe to JIT mode. Shadow process profiles as usual, except handling of special cases Monitor logs special read() system calls and pipes result to shadow processes

11 System Calls For SPEC CPU2000, system calls occur around 35 times per second Forking after each puts lots of pressure on CoW pages, Pin JIT engine 95% of dynamic system calls can be safely handled Some system calls can be allowed to execute (49%) getrusage, _llseek, times, time, brk, munmap, fstat64, close, stat64, umask, getcwd, uname, access, exit_group, …

12 System Calls Some can be replaced with success assumed (39%) write, ftruncate, writev, unlink, rename, … Some are handled specially, but execution may continue (1.8%) mmap2, open(creat), mmap, mprotect, mremap, fcntl read() is special (5.4%) For reads from pipes/sockets, the data must be logged from the original app For reads from files, the file must be closed and reopened after the fork() because the OS file pointer is not duplicated ioctl() is special (4.8%) Frequent in perlbmk Behavior is device-dependent, safest action is to simply terminate the segment and re-fork()

13 Other Issues Shared Memory Disallow writes to shared memory Asynchronous Interrupts (Userspace signals) Since we are only mostly deterministic, no longer an issue When main program receives a signal, pass it along to live children JIT Overhead After each fork(), it is like Pinning a new program Warmup is too slow Use Persistent Code Caching [CGO’07]

14 Multithreaded Programs Issue:fork() does not duplicate all threads Only the thread that called fork() Solution: 1. Barrier all threads in the program and store their CPU state 2. Fork the process and clone new threads for those that were destroyed Identical address space; only register state was really ‘lost’ 3. In each new thread, restore previous CPU state Modified clone() handling in Pin VM 4. Continue execution, virtualize thread IDs for relevant system calls

15 Tuning Overhead Load Number of active shadow processes Tested 0.125, 0.25, 0.5, 1.0, 2.0 Sample Size Number of instructions to profile Longer samples for less overhead, more data Shorter samples for more evenly dispersed data Tested 1M, 10M, 100M

16 Experiments Value Profiling Typical overhead ~100X Accuracy measured by Difference in Invariance Path Profiling Typical overhead 50% - 10X Accuracy measured by percent of hot paths detected (2% threshold) All experiments use SPEC2000 INT Benchmarks with “ref” data set Arithmetic mean of 3 runs presented

17 Results - Value Profiling Overhead Overhead versus native execution Several configurations less than 1% Path profiling exhibits similar trends

18 Results - Value Profiling Accuracy All configurations within 7% of perfect profile Lower is better

19 Results - Path Profiling Accuracy Most configurations over 90% accurate Higher is better Some benchmarks (e.g., 176.gcc, 186.crafty, 187.parser) have millions of paths, but few are “hot”

20 Results - Page Fault Increase Proportional increase in page faults Shadow/Native

21 Results - Page Fault Rate Difference in page faults per second experienced by native application

22 Future Work Improve stability for multithreaded programs Investigate effects of different persistent code cache policies Compare sampling policies Random (current) Phase/event-based Static analysis Study convergence Apply technique Profile-guided optimizations Simulation techniques

23 Conclusion Shadow Profiling allows collection of bursts of detailed traces Accuracy is over 90% Incurs negligible overhead Often less than 1% With increasing numbers of cores, allows developers’ focus to shift from profiling to applying optimizations

Download ppt "Shadow Profiling: Hiding Instrumentation Costs with Parallelism Tipp Moseley Alex Shye Vijay Janapa Reddi Dirk Grunwald (University of Colorado) Ramesh."

Similar presentations

Ads by Google