Pipelined Profiling and Analysis on Multi-core Systems Qin Zhao Ioana Cutcutache Weng-Fai Wong PiPA
CGO Why PiPA? Code profiling and analysis –very useful for understanding program behavior –implemented using dynamic instrumentation systems –several challenges – coverage, accuracy, overhead overhead due to instrumentation engine overhead due to profiling code The performance problem! –Cachegrind - 100x slowdown –Pin dcache - 32x slowdown Need faster tools!
CGO Our Goals Improve the performance –reduce the overall profiling and analysis overhead –but maintain the accuracy How? –parallelize! –optimize Keep it simple –easy to understand –easy to build new analysis tools
CGO Parallelized slice profiling –SuperPin, Shadow Profiling Suitable for simple, independent tasks Previous Approach Uninstrumented application Instrumented application SuperPinned application Original application Instrumentation overhead Profiling overhead Instrumented slices
CGO Pipelining! PiPA Key Idea Instrumented application – stage 0 Profile processing – stage 1 Time Analysis on profile 1 Analysis on profile 2 Analysis on profile 3 Analysis on profile 4 Parallel analysis stage 2 Threads or Processes Original applicationInstrumentation overheadProfiling overhead Profile Information
CGO PiPA Challenges Minimize the profiling overhead –Runtime Execution Profile (REP) Minimize the communication between stages –double buffering Design efficient parallel analysis algorithms –we focus on cache simulation
PiPA Prototype Cache Simulation
CGO Our Prototype Implemented in DynamoRIO Three stages –Stage 0 : instrumented application – collect REP –Stage 1 : parallel profile recovery and splitting –Stage 2 : parallel cache simulation Experiments –SPEC2000 & SPEC2006 benchmarks –3 systems : dual core, quad core, eight core
CGO Communication Keys to minimize the overhead –double buffering –shared buffers –large buffers Example – communication between stage 0 and stage 1 Shared buffersProcessing threads at stage 1Profiling thread at stage 0
Stage 0: Profiling compact profile minimal overhead
CGO Stage 0 : Profiling Runtime Execution Profile (REP) –fast profiling –small profile size –easy information extraction Hierarchical Structure –profile buffers data units –slots Can be customized for different analyses –in our prototype we consider cache simulation
CGO REP Example bb1 REP eax esp bb2 REP Unit esp bb2: pop ebx pop ecx cmp eax, 0 jz label_bb3 bb1: REP Unit tag: 0x080483d7 num_slots:2 num_refs: 3 refs: ref0 Canary Zone REP S REP D profile base pointer Next buffer mov [eax + 0x0c] eax mov ebp esp pop ebp return First buffer type: read offset: 12 value_slot: 1 size_slot: -1 pc: 0x080483dc type: read size: 4 offset: 0 value_slot: 2 size_slot: -1 pc: 0x080483dd type: read size: 4 offset: 4 value_slot: 2 size_slot: -1 pc: 0x080483d7 size: 4 12 bytes
CGO Profiling Optimization Store register values in REP –avoid computing the memory address Register liveness analysis –avoid register stealing if possible Record a single register value for multiple references –a single stack pointer value for a sequence of push/pop –the base address for multiple accesses to the same structure More in the paper
CGO esp size_slot: -1 value_slot: 1 pc: 0x080483d7 size: 4 REP Example bb1 REP eax bb2 REP Unit esp bb2: pop ebx pop ecx cmp eax, 0 jz label_bb3 bb1: REP Unit tag: 0x080483d7 num_slots:2 num_refs: 3 refs: ref0 Canary Zone REP S REP D profile base pointer Next buffer mov [eax + 0x0c] eax mov ebp esp pop ebp return First buffer type: read offset: 12 pc: 0x080483dc type: read size: 4 offset: 0 size_slot: -1 pc: 0x080483dd type: read size: 4 offset: 4 size_slot: -1 value_slot: 2
CGO Slowdown relative to native execution SPECint2000SPECfp2000SPEC Slowdown relative to native execution SPECint2000SPECfp2000SPEC Slowdown relative to native execution SPECint2000SPECfp2000SPEC2000 Profiling Overhead optimized instrumentation instrumentation without optimization 2-core 4-core 8-core Avg slowdown : ~ 3x
Stage 1: Profile Recovery fast recovery
CGO Stage 1 : Profile Recovery Need to reconstruct the full memory reference information – bb1 REP 0x2304 0x141a bb2 REP Unit 0x1423 REP Unit tag: 0x080483d7 num_slots:2 num_refs: 3 refs: ref0 Canary Zone pc: 0x080483d7 type: read size: 4 offset: 12 value_slot: 1 size_slot: -1 pc: 0x080483dc type: read size: 4 offset: 0 value_slot: 2 size_slot: PC Address Type Size x080483d7 read 4 0x080483dc read 4 0x2310 0x141a
CGO Profile Recovery Overhead Factor 1 : buffer size Experiments done on the 8-core system, using 8 recovery threads Slowdown relative to native execution SPECint2000SPECfp2000SPEC2000 small (64k)medium (1M)large (16M)
CGO Profile Recovery Overhead Factor 2 : the number of recovery threads Experiments done on the 8-core system, using 16MB buffers Slowdown relative to native execution SPECint2000SPECfp2000SPEC threads 2 threads 4 threads 6 threads 8 threads
CGO Profile Recovery Overhead Factor 3 : the number of available cores Experiments done using 16MB buffers and 8 recovery threads Slowdown relative to profiling SPECint2000SPECfp2000SPEC cores 4 cores 8 cores
CGO Profile Recovery Overhead Factor 4 : the impact of using REP –experiments done on the 8-core system with 16MB buffers and 8 threads PIPA using REP PIPA using standard profile format PIPA-standard : 20.7x PIPA-REP : 4.5x
Stage 2: Cache Simulation parallel analysis independent simulators
CGO Stage 2 : Parallel Cache Simulation How to parallelize? –split the address trace into independent groups Set associative caches –partition the cache sets and simulate them using several independent simulators –merge the results (no of hits and misses) at the end of the simulation Example: –32K cache, 32-byte line, 4-way associative => 256 sets –4 independent simulators, each one simulates 64 sets (round-robin distribution) PC Address Type Size.... r w r w r 4 0xbf9c4614 0xbf9c4705 0xbf9c4a34 0xbf9c4a60 0xbf9c4a5c Set index: 0: 1: 2: 3: 0xbf9c4614, 0xbf9c4705 0xbf9c4a34 0xbf9c4a60 0xbf9c4a5c..., 0xbf9c460d 0xbf9c460d – two memory references that access different sets are independent
CGO Cache Simulation Overhead PiPA speedup over dcache : 3x Experiments done on the 8-core system –8 recovery threads and 8 cache simulators PiPA Pin dcache 10.5x 32x
CGO SPEC 2006 Results Experiments done using the 8-core system Profiling Profiling + recovery Full cache simulation Average speedup over dcache : 3x 3.7x 10.2x 3.27x
CGO Summary PiPA is an effective technique for parallel profiling and analysis –based on pipelining –drastically reduces both profiling time analysis time –full cache simulation incurs only 10.5x slowdown Runtime Execution Profile –requires minimal instrumentation code –compact enough to ensure optimal buffer usage –makes it easy for next stages to recover the full trace Parallel cache simulation –the cache is partitioned into several independent simulators
CGO Future Work Design APIs –hide the communication between the pipeline stages –focus only on the instrumentation and analysis tasks Further improve the efficiency –parallel profiling –workload monitoring More analysis algorithms –branch prediction simulation –memory dependence analysis –...
CGO Pin Prototype Second implementation in Pin Preliminary results – 2.6x speedup over Pin dcache Plan to release PiPA