Pipelined Profiling and Analysis on Multi-core Systems Qin Zhao Ioana Cutcutache Weng-Fai Wong PiPA.

Slides:

Advertisements

Similar presentations

Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,

Advertisements

SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.

D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Dec 5, 2007University of Virginia1 Efficient Dynamic Tainting using Multiple Cores Yan Huang University of Virginia Dec

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.

PipelinedImplementation Part I CSC 333. – 2 – Overview General Principles of Pipelining Goal Difficulties Creating a Pipelined Y86 Processor Rearranging.

1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)

Computer Organization and Architecture The CPU Structure.

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

Multiscalar processors

1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)

Chapter 12 CPU Structure and Function. Example Register Organizations.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

EENG449b/Savvides Lec /25/05 March 24, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

Optimizing RAM-latency Dominated Applications

Qin Zhao (MIT) Derek Bruening (VMware) Saman Amarasinghe (MIT) Umbra: Efficient and Scalable Memory Shadowing CGO 2010, Toronto, Canada April 26, 2010.

Secure Embedded Processing through Hardware-assisted Run-time Monitoring Zubin Kumar.

PMaC Performance Modeling and Characterization Performance Modeling and Analysis with PEBIL Michael Laurenzano, Ananta Tiwari, Laura Carrington Performance.

Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

1 Seoul National University Pipelined Implementation : Part I.

Transmeta and Dynamic Code Optimization Ashwin Bharambe Mahim Mishra Matthew Rosencrantz.

ReSlice: Selective Re-execution of Long-retired Misspeculated Instructions Using Forward Slicing Smruti R. Sarangi, Wei Liu, Josep Torrellas, Yuanyuan.

Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.

Lawrence Livermore National Laboratory Pianola: A script-based I/O benchmark Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA

Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.

Time Parallel Simulations I Problem-Specific Approach to Create Massively Parallel Simulations.

Determina, Inc. Persisting Information Across Application Executions Derek Bruening Determina, Inc.

Virtual Application Profiler (VAPP) Problem – Increasing hardware complexity – Programmers need to understand interactions between architecture and their.

Where’s the FEEB?: Effectiveness of Instruction Set Randomization Nora Sovarel, David Evans, Nate Paul University of Virginia Computer Science USENIX Security.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14.

MULTIPLEX: UNIFYING CONVENTIONAL AND SPECULATIVE THREAD-LEVEL PARALLELISM ON A CHIP MULTIPROCESSOR Presented by: Ashok Venkatesan Chong-Liang Ooi, Seon.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

Update on G5 prototype Andrei Gheata Computing Upgrade Weekly Meeting 26 June 2012.

1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Correct RelocationMarch 20, 2016 Correct Relocation: Do You Trust a Mutated Binary? Drew Bernat

Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 12-14, 2010 Paradyn Project Safe and Efficient Instrumentation Andrew Bernat.

1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.

Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore

Chapter Overview General Concepts IA-32 Processor Architecture

Real-World Pipelines Idea Divide process into independent stages

Reducing Hit Time Small and simple caches Way prediction Trace caches

INTEL HYPER THREADING TECHNOLOGY

5.2 Eleven Advanced Optimizations of Cache Performance

Hyperthreading Technology

Lecture: SMT, Cache Hierarchies

Department of Computer Science University of California, Santa Barbara

Hardware Multithreading

Lecture: SMT, Cache Hierarchies

John-Paul Fryckman CSE 231: Paper Presentation 23 May 2002

Pipelined Implementation : Part I

Lecture 10: Branch Prediction and Instruction Delivery

Lecture 20: OOO, Memory Hierarchy

Lecture: SMT, Cache Hierarchies

Lecture 20: OOO, Memory Hierarchy

* From AMD 1996 Publication #18522 Revision E

Lecture: SMT, Cache Hierarchies

Department of Computer Science University of California, Santa Barbara

rePLay: A Hardware Framework for Dynamic Optimization

Dynamic Binary Translators and Instrumenters

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

Pipelined Profiling and Analysis on Multi-core Systems Qin Zhao Ioana Cutcutache Weng-Fai Wong PiPA

CGO Why PiPA? Code profiling and analysis –very useful for understanding program behavior –implemented using dynamic instrumentation systems –several challenges – coverage, accuracy, overhead overhead due to instrumentation engine overhead due to profiling code The performance problem! –Cachegrind - 100x slowdown –Pin dcache - 32x slowdown Need faster tools!

CGO Our Goals Improve the performance –reduce the overall profiling and analysis overhead –but maintain the accuracy How? –parallelize! –optimize Keep it simple –easy to understand –easy to build new analysis tools

CGO Parallelized slice profiling –SuperPin, Shadow Profiling Suitable for simple, independent tasks Previous Approach Uninstrumented application Instrumented application SuperPinned application Original application Instrumentation overhead Profiling overhead Instrumented slices

CGO Pipelining! PiPA Key Idea Instrumented application – stage 0 Profile processing – stage 1 Time Analysis on profile 1 Analysis on profile 2 Analysis on profile 3 Analysis on profile 4 Parallel analysis stage 2 Threads or Processes Original applicationInstrumentation overheadProfiling overhead Profile Information

CGO PiPA Challenges Minimize the profiling overhead –Runtime Execution Profile (REP) Minimize the communication between stages –double buffering Design efficient parallel analysis algorithms –we focus on cache simulation

PiPA Prototype Cache Simulation

CGO Our Prototype Implemented in DynamoRIO Three stages –Stage 0 : instrumented application – collect REP –Stage 1 : parallel profile recovery and splitting –Stage 2 : parallel cache simulation Experiments –SPEC2000 & SPEC2006 benchmarks –3 systems : dual core, quad core, eight core

CGO Communication Keys to minimize the overhead –double buffering –shared buffers –large buffers Example – communication between stage 0 and stage 1 Shared buffersProcessing threads at stage 1Profiling thread at stage 0

Stage 0: Profiling compact profile minimal overhead

CGO Stage 0 : Profiling Runtime Execution Profile (REP) –fast profiling –small profile size –easy information extraction Hierarchical Structure –profile buffers data units –slots Can be customized for different analyses –in our prototype we consider cache simulation

CGO REP Example bb1 REP eax esp bb2 REP Unit esp bb2: pop ebx pop ecx cmp eax, 0 jz label_bb3 bb1: REP Unit tag: 0x080483d7 num_slots:2 num_refs: 3 refs: ref0 Canary Zone REP S REP D profile base pointer Next buffer mov [eax + 0x0c]  eax mov ebp  esp pop ebp return First buffer type: read offset: 12 value_slot: 1 size_slot: -1 pc: 0x080483dc type: read size: 4 offset: 0 value_slot: 2 size_slot: -1 pc: 0x080483dd type: read size: 4 offset: 4 value_slot: 2 size_slot: -1 pc: 0x080483d7 size: 4 12 bytes

CGO Profiling Optimization Store register values in REP –avoid computing the memory address Register liveness analysis –avoid register stealing if possible Record a single register value for multiple references –a single stack pointer value for a sequence of push/pop –the base address for multiple accesses to the same structure More in the paper

CGO esp size_slot: -1 value_slot: 1 pc: 0x080483d7 size: 4 REP Example bb1 REP eax bb2 REP Unit esp bb2: pop ebx pop ecx cmp eax, 0 jz label_bb3 bb1: REP Unit tag: 0x080483d7 num_slots:2 num_refs: 3 refs: ref0 Canary Zone REP S REP D profile base pointer Next buffer mov [eax + 0x0c]  eax mov ebp  esp pop ebp return First buffer type: read offset: 12 pc: 0x080483dc type: read size: 4 offset: 0 size_slot: -1 pc: 0x080483dd type: read size: 4 offset: 4 size_slot: -1 value_slot: 2

CGO Slowdown relative to native execution SPECint2000SPECfp2000SPEC Slowdown relative to native execution SPECint2000SPECfp2000SPEC Slowdown relative to native execution SPECint2000SPECfp2000SPEC2000 Profiling Overhead optimized instrumentation instrumentation without optimization 2-core 4-core 8-core Avg slowdown : ~ 3x

Stage 1: Profile Recovery fast recovery

CGO Stage 1 : Profile Recovery Need to reconstruct the full memory reference information – bb1 REP 0x2304 0x141a bb2 REP Unit 0x1423 REP Unit tag: 0x080483d7 num_slots:2 num_refs: 3 refs: ref0 Canary Zone pc: 0x080483d7 type: read size: 4 offset: 12 value_slot: 1 size_slot: -1 pc: 0x080483dc type: read size: 4 offset: 0 value_slot: 2 size_slot: PC Address Type Size x080483d7 read 4 0x080483dc read 4 0x2310 0x141a

CGO Profile Recovery Overhead Factor 1 : buffer size Experiments done on the 8-core system, using 8 recovery threads Slowdown relative to native execution SPECint2000SPECfp2000SPEC2000 small (64k)medium (1M)large (16M)

CGO Profile Recovery Overhead Factor 2 : the number of recovery threads Experiments done on the 8-core system, using 16MB buffers Slowdown relative to native execution SPECint2000SPECfp2000SPEC threads 2 threads 4 threads 6 threads 8 threads

CGO Profile Recovery Overhead Factor 3 : the number of available cores Experiments done using 16MB buffers and 8 recovery threads Slowdown relative to profiling SPECint2000SPECfp2000SPEC cores 4 cores 8 cores

CGO Profile Recovery Overhead Factor 4 : the impact of using REP –experiments done on the 8-core system with 16MB buffers and 8 threads PIPA using REP PIPA using standard profile format PIPA-standard : 20.7x PIPA-REP : 4.5x

Stage 2: Cache Simulation parallel analysis independent simulators

CGO Stage 2 : Parallel Cache Simulation How to parallelize? –split the address trace into independent groups Set associative caches –partition the cache sets and simulate them using several independent simulators –merge the results (no of hits and misses) at the end of the simulation Example: –32K cache, 32-byte line, 4-way associative => 256 sets –4 independent simulators, each one simulates 64 sets (round-robin distribution) PC Address Type Size.... r w r w r 4 0xbf9c4614 0xbf9c4705 0xbf9c4a34 0xbf9c4a60 0xbf9c4a5c Set index: 0: 1: 2: 3: 0xbf9c4614, 0xbf9c4705 0xbf9c4a34 0xbf9c4a60 0xbf9c4a5c..., 0xbf9c460d 0xbf9c460d – two memory references that access different sets are independent

CGO Cache Simulation Overhead PiPA speedup over dcache : 3x Experiments done on the 8-core system –8 recovery threads and 8 cache simulators PiPA Pin dcache 10.5x 32x

CGO SPEC 2006 Results Experiments done using the 8-core system Profiling Profiling + recovery Full cache simulation Average speedup over dcache : 3x 3.7x 10.2x 3.27x

CGO Summary PiPA is an effective technique for parallel profiling and analysis –based on pipelining –drastically reduces both profiling time analysis time –full cache simulation incurs only 10.5x slowdown Runtime Execution Profile –requires minimal instrumentation code –compact enough to ensure optimal buffer usage –makes it easy for next stages to recover the full trace Parallel cache simulation –the cache is partitioned into several independent simulators

CGO Future Work Design APIs –hide the communication between the pipeline stages –focus only on the instrumentation and analysis tasks Further improve the efficiency –parallel profiling –workload monitoring More analysis algorithms –branch prediction simulation –memory dependence analysis –...

CGO Pin Prototype Second implementation in Pin Preliminary results – 2.6x speedup over Pin dcache Plan to release PiPA