Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pipelined Profiling and Analysis on Multi-core Systems Qin Zhao Ioana Cutcutache Weng-Fai Wong PiPA.

Similar presentations


Presentation on theme: "Pipelined Profiling and Analysis on Multi-core Systems Qin Zhao Ioana Cutcutache Weng-Fai Wong PiPA."— Presentation transcript:

1 Pipelined Profiling and Analysis on Multi-core Systems Qin Zhao Ioana Cutcutache Weng-Fai Wong PiPA

2 CGO 2008 2 Why PiPA? Code profiling and analysis –very useful for understanding program behavior –implemented using dynamic instrumentation systems –several challenges – coverage, accuracy, overhead overhead due to instrumentation engine overhead due to profiling code The performance problem! –Cachegrind - 100x slowdown –Pin dcache - 32x slowdown Need faster tools!

3 CGO 2008 3 Our Goals Improve the performance –reduce the overall profiling and analysis overhead –but maintain the accuracy How? –parallelize! –optimize Keep it simple –easy to understand –easy to build new analysis tools

4 CGO 2008 4 Parallelized slice profiling –SuperPin, Shadow Profiling Suitable for simple, independent tasks Previous Approach Uninstrumented application Instrumented application SuperPinned application Original application Instrumentation overhead Profiling overhead Instrumented slices

5 CGO 2008 5 Pipelining! PiPA Key Idea Instrumented application – stage 0 Profile processing – stage 1 Time Analysis on profile 1 Analysis on profile 2 Analysis on profile 3 Analysis on profile 4 Parallel analysis stage 2 Threads or Processes Original applicationInstrumentation overheadProfiling overhead Profile Information

6 CGO 2008 6 PiPA Challenges Minimize the profiling overhead –Runtime Execution Profile (REP) Minimize the communication between stages –double buffering Design efficient parallel analysis algorithms –we focus on cache simulation

7 PiPA Prototype Cache Simulation

8 CGO 2008 8 Our Prototype Implemented in DynamoRIO Three stages –Stage 0 : instrumented application – collect REP –Stage 1 : parallel profile recovery and splitting –Stage 2 : parallel cache simulation Experiments –SPEC2000 & SPEC2006 benchmarks –3 systems : dual core, quad core, eight core

9 CGO 2008 9 Communication Keys to minimize the overhead –double buffering –shared buffers –large buffers Example – communication between stage 0 and stage 1 Shared buffersProcessing threads at stage 1Profiling thread at stage 0

10 Stage 0: Profiling compact profile minimal overhead

11 CGO 2008 11 Stage 0 : Profiling Runtime Execution Profile (REP) –fast profiling –small profile size –easy information extraction Hierarchical Structure –profile buffers data units –slots Can be customized for different analyses –in our prototype we consider cache simulation

12 CGO 2008 12 REP Example bb1 REP eax esp bb2 REP Unit esp bb2: pop ebx pop ecx cmp eax, 0 jz label_bb3 bb1: REP Unit tag: 0x080483d7 num_slots:2 num_refs: 3 refs: ref0 Canary Zone............ REP S REP D profile base pointer...... Next buffer mov [eax + 0x0c]  eax mov ebp  esp pop ebp return First buffer type: read offset: 12 value_slot: 1 size_slot: -1 pc: 0x080483dc type: read size: 4 offset: 0 value_slot: 2 size_slot: -1 pc: 0x080483dd type: read size: 4 offset: 4 value_slot: 2 size_slot: -1 pc: 0x080483d7 size: 4 12 bytes

13 CGO 2008 13 Profiling Optimization Store register values in REP –avoid computing the memory address Register liveness analysis –avoid register stealing if possible Record a single register value for multiple references –a single stack pointer value for a sequence of push/pop –the base address for multiple accesses to the same structure More in the paper

14 CGO 2008 14 esp size_slot: -1 value_slot: 1 pc: 0x080483d7 size: 4 REP Example bb1 REP eax bb2 REP Unit esp bb2: pop ebx pop ecx cmp eax, 0 jz label_bb3 bb1: REP Unit tag: 0x080483d7 num_slots:2 num_refs: 3 refs: ref0 Canary Zone............ REP S REP D profile base pointer...... Next buffer mov [eax + 0x0c]  eax mov ebp  esp pop ebp return First buffer type: read offset: 12 pc: 0x080483dc type: read size: 4 offset: 0 size_slot: -1 pc: 0x080483dd type: read size: 4 offset: 4 size_slot: -1 value_slot: 2

15 CGO 2008 15 0 1 2 3 4 5 6 7 8 9 10 Slowdown relative to native execution SPECint2000SPECfp2000SPEC2000 0 1 2 3 4 5 6 7 8 9 Slowdown relative to native execution SPECint2000SPECfp2000SPEC2000 0 1 2 3 4 5 6 7 8 Slowdown relative to native execution SPECint2000SPECfp2000SPEC2000 Profiling Overhead optimized instrumentation instrumentation without optimization 2-core 4-core 8-core Avg slowdown : ~ 3x

16 Stage 1: Profile Recovery fast recovery

17 CGO 2008 17 Stage 1 : Profile Recovery Need to reconstruct the full memory reference information – bb1 REP 0x2304 0x141a bb2 REP Unit 0x1423 REP Unit tag: 0x080483d7 num_slots:2 num_refs: 3 refs: ref0 Canary Zone.................. pc: 0x080483d7 type: read size: 4 offset: 12 value_slot: 1 size_slot: -1 pc: 0x080483dc type: read size: 4 offset: 0 value_slot: 2 size_slot: -1... PC Address Type Size.................................. 0x080483d7 read 4 0x080483dc read 4 0x2310 0x141a..................................

18 CGO 2008 18 Profile Recovery Overhead Factor 1 : buffer size Experiments done on the 8-core system, using 8 recovery threads 0 1 2 3 4 5 6 7 8 9 Slowdown relative to native execution SPECint2000SPECfp2000SPEC2000 small (64k)medium (1M)large (16M)

19 CGO 2008 19 Profile Recovery Overhead Factor 2 : the number of recovery threads Experiments done on the 8-core system, using 16MB buffers 0 2 4 6 8 10 12 14 16 18 20 Slowdown relative to native execution SPECint2000SPECfp2000SPEC2000 0 threads 2 threads 4 threads 6 threads 8 threads

20 CGO 2008 20 Profile Recovery Overhead Factor 3 : the number of available cores Experiments done using 16MB buffers and 8 recovery threads 0.00 0.50 1.00 1.50 2.00 2.50 Slowdown relative to profiling SPECint2000SPECfp2000SPEC2000 2 cores 4 cores 8 cores

21 CGO 2008 21 Profile Recovery Overhead Factor 4 : the impact of using REP –experiments done on the 8-core system with 16MB buffers and 8 threads PIPA using REP PIPA using standard profile format PIPA-standard : 20.7x PIPA-REP : 4.5x

22 Stage 2: Cache Simulation parallel analysis independent simulators

23 CGO 2008 23 818382485648 Stage 2 : Parallel Cache Simulation How to parallelize? –split the address trace into independent groups Set associative caches –partition the cache sets and simulate them using several independent simulators –merge the results (no of hits and misses) at the end of the simulation Example: –32K cache, 32-byte line, 4-way associative => 256 sets –4 independent simulators, each one simulates 64 sets (round-robin distribution) PC Address Type Size.... r 4.... w 4.... r 4.... w 4.... r 4 0xbf9c4614 0xbf9c4705 0xbf9c4a34 0xbf9c4a60 0xbf9c4a5c Set index: 0: 1: 2: 3: 0xbf9c4614, 0xbf9c4705 0xbf9c4a34 0xbf9c4a60 0xbf9c4a5c..., 0xbf9c460d 0xbf9c460d – two memory references that access different sets are independent

24 CGO 2008 24 Cache Simulation Overhead PiPA speedup over dcache : 3x Experiments done on the 8-core system –8 recovery threads and 8 cache simulators PiPA Pin dcache 10.5x 32x

25 CGO 2008 25 SPEC 2006 Results Experiments done using the 8-core system Profiling Profiling + recovery Full cache simulation Average speedup over dcache : 3x 3.7x 10.2x 3.27x

26 CGO 2008 26 Summary PiPA is an effective technique for parallel profiling and analysis –based on pipelining –drastically reduces both profiling time analysis time –full cache simulation incurs only 10.5x slowdown Runtime Execution Profile –requires minimal instrumentation code –compact enough to ensure optimal buffer usage –makes it easy for next stages to recover the full trace Parallel cache simulation –the cache is partitioned into several independent simulators

27 CGO 2008 27 Future Work Design APIs –hide the communication between the pipeline stages –focus only on the instrumentation and analysis tasks Further improve the efficiency –parallel profiling –workload monitoring More analysis algorithms –branch prediction simulation –memory dependence analysis –...

28 CGO 2008 28 Pin Prototype Second implementation in Pin Preliminary results – 2.6x speedup over Pin dcache Plan to release PiPA www.comp.nus.edu.sg/~ioana


Download ppt "Pipelined Profiling and Analysis on Multi-core Systems Qin Zhao Ioana Cutcutache Weng-Fai Wong PiPA."

Similar presentations


Ads by Google