Profiler In software engineering, profiling ("program profiling", "software profiling") is a form of dynamic program analysis that measures, for example,

Slides:



Advertisements
Similar presentations
CPU Structure and Function
Advertisements

COMP375 Computer Architecture and Organization Senior Review.
Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
SE-292 High Performance Computing Profiling and Performance R. Govindarajan
Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a cache for secondary (disk) storage – Managed jointly.
Programming Technologies, MIPT, April 7th, 2012 Introduction to Binary Translation Technology Roman Sokolov SMWare
Memory Protection: Kernel and User Address Spaces  Background  Address binding  How memory protection is achieved.
Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,
Exploring P4 Trace Cache Features Ed Carpenter Marsha Robinson Jana Wooten.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Intel® performance analyze tools Nikita Panov Idrisov Renat.
Pin : Building Customized Program Analysis Tools with Dynamic Instrumentation Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
OS Fall ’ 02 Introduction Operating Systems Fall 2002.
Computer ArchitectureFall 2007 © November 21, 2007 Karem A. Sakallah Lecture 23 Virtual Memory (2) CS : Computer Architecture.
OS Spring’03 Introduction Operating Systems Spring 2003.
San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Code Coverage Testing Using Hardware Performance Monitoring Support Alex Shye, Matthew Iyer, Vijay Janapa Reddi and Daniel A. Connors University of Colorado.
OS Spring’04 Introduction Operating Systems Spring 2004.
Spring 2014 SILICON VALLEY UNIVERSITY CONFIDENTIAL 1 Introduction to Embedded Systems Dr. Jerry Shiao, Silicon Valley University.
Multi-core Programming VTune Analyzer Basics. 2 Basics of VTune™ Performance Analyzer Topics What is the VTune™ Performance Analyzer? Performance tuning.
Lecture 8. Profiling - for Performance Analysis - Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture &
Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,
Software Performance Analysis Using CodeAnalyst for Windows Sherry Hurwitz SW Applications Manager SRD Advanced Micro Devices Lei.
Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design.
Timing and Profiling ECE 454 Computer Systems Programming Topics: Measuring and Profiling Cristiana Amza.
INTRODUCTION SOFTWARE HARDWARE DIFFERENCE BETWEEN THE S/W AND H/W.
JPCM - JDC121 JPCM. Agenda JPCM - JDC122 3 Software performance is Better Performance tuning requires accurate Measurements. JPCM - JDC124 Software.
Scalable Support for Multithreaded Applications on Dynamic Binary Instrumentation Systems Kim Hazelwood Greg Lueck Robert Cohn.
Virtual Memory Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University.
Interrupt driven I/O. MIPS RISC Exception Mechanism The processor operates in The processor operates in user mode user mode kernel mode kernel mode Access.
Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.
Profiling Tools Introduction to Computer System, Fall (PPI, FDU) Vtune & GProfile.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Software Performance Monitoring Daniele Francesco Kruse July 2010.
Full and Para Virtualization
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
Interrupt driven I/O Computer Organization and Assembly Language: Module 12.
PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.
CSE 451: Operating Systems Winter 2015 Module 25 Virtual Machine Monitors Mark Zbikowski Allen Center 476 © 2013 Gribble, Lazowska,
Instruction-Based Sampling and AMD CodeAnalyst ISPASS 2010 poster session Paul J. Drongowski | March 29, 2010.
Performance profiling of Experiments’ Geant4 Simulations Geant4 Technical Forum Ryszard Jurga.
Performance Monitoring Update Daniele Francesco Kruse August 2010.
LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,
Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a “cache” for secondary (disk) storage – Managed jointly.
1 ROGUE Dynamic Optimization Framework Using Pin Vijay Janapa Reddi PhD. Candidate - Electrical And Computer Engineering University of Colorado at Boulder.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Introduction to Operating Systems Concepts
Chapter Overview General Concepts IA-32 Processor Architecture
Two notions of performance
Memory COMPUTER ARCHITECTURE
Yu-Lun Kuo Computer Sciences and Information Engineering
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Computer Structure Multi-Threading
INTEL HYPER THREADING TECHNOLOGY
Flow Path Model of Superscalars
Pipelining: Advanced ILP
OS Virtualization.
The Microarchitecture of the Pentium 4 processor
Tools.
Tools.
Introduction to OProfile
CSE 451: Operating Systems Autumn Module 24 Virtual Machine Monitors
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
CSE 451: Operating Systems Autumn Module 24 Virtual Machine Monitors
OPERATING SYSTEMS MEMORY MANAGEMENT BY DR.V.R.ELANGOVAN.
Dynamic Binary Translators and Instrumenters
Presentation transcript:

Profiler In software engineering, profiling ("program profiling", "software profiling") is a form of dynamic program analysis that measures, for example, the space (memory) or time complexity of a program, the usage of particular instructions, or the frequency and duration of function calls. (Wikipedia)software engineeringdynamic program analysiscomplexity of a program usage of particular instructions Static Analysis Tools Dynamic Analysis Tools Hybrid Analysis Tools (Source - An Overview of Software Performance Analysis Tools and Techniques: From GProf to Dtrace)

Static Analysis Tools Two methods – Source Code Instrumentation – data collection or sampling routines are added Performance slowdown due to instrumented code Can potentially modify the behaviour of program – Sampling Compile Time instrumentation Tools – Source code instrumented by compiler – Counters and Monitor Calls inserted – Generate Function Call Graphs – gprof Sampling Tools – At regular intervals record state – Accuracy and runtime depend on Sampling rate – User level monitoring code (qprof) – Kernel level monitoring code (oprofile)

Compile Time Instrumentation Tool

Sampling Tool

Static Analysis Tools Hardware Counter Tools – Use h/w event counters provided in processors (e.g. cache misses, number of floating point instructions etc.) – Perfsuite (linux based) Compound Tools – ST plus HCT – vTune (Intel), Code XL(AMD)

gprof Uses Compile Time Instrumentation and Sampling Compile source code with –pg Run the executable, will generate gmon.out gprof gmon.out > profiler.output Flat profile – time spent in each function Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name open offtime memccpy write

Dynamic Analysis Tools Binary level modifications at runtime Two types – Binary Instrumentation Customized analysis routines inserted at arbitrary points at runtime APIs provided by the tool can be used to decide what to monitor Pin – Probing Predefined Shared library or kernel routines (Probes) to collect data DTrace

PIN pin –t pintool – application pin –t pintool –pid 1234 (attach to a process) – pin -> instrumentation engine – pintool -> instrumentation tool Instrumentation routines – where instrumentation is inserted (e.g. before instr) Analysis routines – what to do (e.g. incr counter)

Pintool example Instr pointer (PC) Effective addr of a mem write

PLDI’0510 Dynamic Instrumentation Original code Code cache Pin fetches trace starting block 1 and start instrumentation 7’ 2’ 1’ Pin Exits point back to Pin

PLDI’0511 Dynamic Instrumentation Original code Code cache Pin transfers control into code cache (block 1) ’ 2’ 1’ Pin

PLDI’0512 Dynamic Instrumentation Original code Code cache 7’ 2’ 1’ Pin Pin fetches and instrument a new trace 6’ 5’ 3’ trace linking

PLDI’0513 Pin’s Software Architecture JIT Compiler Emulation Unit Virtual Machine (VM) Code Cache Instrumentation APIs Application Operating System Hardware Pin Pintool  3 programs (Pin, Pintool, App) in same address space:  User-level only  Instrumentation APIs:  Through which Pintool communicates with Pin  JIT compiler:  Dynamically compile and instrument  Emulation unit:  Handle insts that can’t be directly executed (e.g., syscalls)  Code cache:  Store compiled code => Coordinated by VM Address space

CodeXL(AMD) Source – CodeXL User Manual Available on Windows and Linux Statistical Sampling Based Three profile modes – Time Based Profile (TBP) – to identify ‘hotspots’ – Event Based Profile (EBP) – uses HW Performance Monitor Counters (PMC). Example: processor clock cycles, retired instructions, data cache accesses, and data cache misses – Instruction Based Sampling (IBS)

CodeXL Usage Two modes: – Execute a CPU Profile Session – Attach to a process Select a CPU profile type – Time based profile – Event based profile – Instruction based sampling

Event Based Profiling ( Assess Performance ) Retired Instructions CPU clock cycles not halted Retired branch instructions Retired mispredicted branch instructions Data cache accesses Data cache misses L1 DTLB and L2 DTLB misses Misaligned accesses

Event Based Profiling ( Investigate Branching ) Retired Instructions Retired branch instructions Retired mispredicted branch instructions Retired taken branch instructions Retired near returns Retired mispredicted near returns Retired mispredicted indirect branches

Event Based Profiling ( Investigate Data Access ) Retired Instructions Data cache accesses Data cache misses Data cache refills from L2 or Northbridge L1 DTLB miss and L2 DTLB hit L1 DTLB and L2 DTLB misses Misaligned accesses

Event Based Profiling ( Investigate Instruction Access ) Retired Instructions Instruction Cache fetches Instruction Cache misses L1 ITLB miss and L2 ITLB hits L1 ITLB miss and L2 ITLB miss

Event Based Profiling ( Investigate L2 Cache Access) Retired Instructions Requests to L2 cache L2 cache misses L2 fill/writeback

Event Based Profiling (Cache Line Utilization) (CLU) measures how much of a cache line is used (read or written) before it is evicted from the cache. EventDescription Cache Line Utilization Percentage The cache line utilization percentage for all cache lines on all cores accessed by this instruction / function / module. Line Boundary CrossingsThe number of accesses to the cache line that spanned two cache lines. This happens when an unaligned access is made that causes two cache lines to be touched. Bytes/L1 EvictionThe number of bytes accessed between cache line evictions. Accesses/L1 EvictionThe number of accesses (loads plus stores) to a cache line between evictions. L1 EvictionsThe number of times a cache line was evicted where this instruction depended on the data in the cache line. AccessesThe total number of loads and stores samples for this instruction / function / module. Bytes AccessedThe total number of bytes accessed by this instruction / function / module.

Instruction Based Sampling Two main categories of pipeline stages: – Instruction Fetch – Instruction Execution IBS Fetch Sampling: – Select a fetch at the end of a sampling period (certain number of fetched instructions) – IBS fetch sample is taken when the selected fetch completes/aborts The fetch address Whether the fetch completed or aborted Whether the fetch missed in the instruction cache (IC) Whether the fetch missed in the level 1 or level 2 ITLB The page size of the address translation The fetch latency, i.e., cycles from when the fetch was initiated to when the fetch either completed or aborted

Instruction Based Sampling IBS op Sampling : Each AMD64 instruction translated into or more macro ops. Counts processor cycles/dispatchd ops and periodically selects an op to be tagged and monitored and IBS sample is generated only if the op retires – The AMD64 instruction address for the op – The tag-to-retire time (cycles from when the op was tagged to when the op retired) – The completion-to-retire time (cycles from when the op completed to when the op retired) – Whether the op implements AMD64 branch semantics (a "branch op") If the branch op was mispredicted If the branch was taken If the branch was a return If the return was mispredicted – Whether the op performed a load and/or store operation If the operation missed in the data cache If the operation missed in the level 1 or level 2 DTLB The page size of the level 1 or level 2 address translation If the operation caused a misaligned access The DC miss latency (in cycles) if the load operation missed in the data cache The virtual and physical address of the requested memory location If the access was made to local or remote memory