The Vampir Performance Analysis Tool Hans–Christian Hoppe Gesellschaft für Parallele Anwendungen und Systeme mbH Pallas GmbH Hermülheimer Straße 10 D-50321.

Slides:



Advertisements
Similar presentations
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Advertisements

Phillip Dickens, Department of Computer Science, University of Maine. In collaboration with Jeremy Logan, Postdoctoral Research Associate, ORNL. Improving.
Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.
Program Analysis and Tuning The German High Performance Computing Centre for Climate and Earth System Research Panagiotis Adamidis.
Thoughts on Shared Caches Jeff Odom University of Maryland.
Automated Instrumentation and Monitoring System (AIMS)
MCTS GUIDE TO MICROSOFT WINDOWS 7 Chapter 10 Performance Tuning.
The Path to Multi-core Tools Paul Petersen. Multi-coreToolsThePathTo 2 Outline Motivation Where are we now What is easy to do next What is missing.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
Chapter 15 Chapter 15: Network Monitoring and Tuning.
Computer System Overview
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.
Instrumentation and Measurement CSci 599 Class Presentation Shreyans Mehta.
Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access memory.
Intel Trace Collector and Trace Analyzer Evaluation Report Hans Sherburne, Adam Leko UPC Group HCS Research Laboratory University of Florida Color encoding.
Chapter 9 Overview  Reasons to monitor SQL Server  Performance Monitoring and Tuning  Tools for Monitoring SQL Server  Common Monitoring and Tuning.
Module 8: Monitoring SQL Server for Performance. Overview Why to Monitor SQL Server Performance Monitoring and Tuning Tools for Monitoring SQL Server.
1 The VAMPIR and PARAVER performance analysis tools applied to a wet chemical etching parallel algorithm S. Boeriu 1 and J.C. Bruch, Jr. 2 1 Center for.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.
FALL 2005CSI 4118 – UNIVERSITY OF OTTAWA1 Part 4 Web technologies: HTTP, CGI, PHP,Java applets)
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
MCTS Guide to Microsoft Windows 7
UPC/SHMEM PAT High-level Design v.1.1 Hung-Hsun Su UPC Group, HCS lab 6/21/2005.
1 Performance Analysis with Vampir DKRZ Tutorial – 7 August, Hamburg Matthias Weber, Frank Winkler, Andreas Knüpfer ZIH, Technische Universität.
Paradyn Week – April 14, 2004 – Madison, WI DPOMP: A DPCL Based Infrastructure for Performance Monitoring of OpenMP Applications Bernd Mohr Forschungszentrum.
Welcome to the Power of 64-bit Computing …now available on your desktop! © 1998, 1999 Compaq Computer Corporation.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
CCS APPS CODE COVERAGE. CCS APPS Code Coverage Definition: –The amount of code within a program that is exercised Uses: –Important for discovering code.
Adventures in Mastering the Use of Performance Evaluation Tools Manuel Ríos Morales ICOM 5995 December 4, 2002.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Threads and Processes.
Chapter 34 Java Technology for Active Web Documents methods used to provide continuous Web updates to browser – Server push – Active documents.
Parallel Computer Architecture and Interconnect 1b.1.
VAMPIR. Visualization and Analysis of MPI Resources Commercial tool from PALLAS GmbH VAMPIRtrace - MPI profiling library VAMPIR - trace visualization.
CHAPTER TEN AUTHORING.
Syzygy Design overview Distributed Scene Graph Master/slave application framework I/O Device Integration using Syzygy Scaling down: simulators and other.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Introduction to Concurrency.
Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
1 Performance Analysis with Vampir ZIH, Technische Universität Dresden.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Performance Monitoring Tools on TCS Roberto Gomez and Raghu Reddy Pittsburgh Supercomputing Center David O’Neal National Center for Supercomputing Applications.
Profiling, Tracing, Debugging and Monitoring Frameworks Sathish Vadhiyar Courtesy: Dr. Shirley Moore (University of Tennessee)
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Framework for MDO Studies Amitay Isaacs Center for Aerospace System Design and Engineering IIT Bombay.
Portable Parallel Performance Tools Shirley Browne, UTK Clay Breshears, CEWES MSRC Jan 27-28, 1998.
Chapter 4 – Threads (Pgs 153 – 174). Threads  A "Basic Unit of CPU Utilization"  A technique that assists in performing parallel computation by setting.
Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:
Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
Overview of AIMS Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green:
Outline Why this subject? What is High Performance Computing?
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.
1 Advanced MPI William D. Gropp Rusty Lusk and Rajeev Thakur Mathematics and Computer Science Division Argonne National Laboratory.
Tuning Threaded Code with Intel® Parallel Amplifier.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
SQL Database Management
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
MCTS Guide to Microsoft Windows 7
CMSC 611: Advanced Computer Architecture
Allen D. Malony Computer & Information Science Department
Projections Overview Ronak Buch & Laxmikant (Sanjay) Kale
Lecture 23: Virtual Memory, Multiprocessors
Presentation transcript:

The Vampir Performance Analysis Tool Hans–Christian Hoppe Gesellschaft für Parallele Anwendungen und Systeme mbH Pallas GmbH Hermülheimer Straße 10 D Brühl, Germany SCICOMP 2000 Tutorial, San Diego

© Pallas GmbH Outline Performance tools for parallel programming Performance analysis for MPI The Vampir tool The Vampir roadmap

© Pallas GmbH Why performance tools? CPUs and interconnects are getting faster all the time Compilers are improving “Abundance of computing power” Shouldn’t it be sufficient to just write an application and let the system do the rest?

© Pallas GmbH Why performance tools? In reality, there remain severe performance bottlenecks –slow memory access (instructions and data) –cache consistency effects –starvation of instruction units –contention of interconnection systems –adverse interaction with schedulers

© Pallas GmbH Why performance tools? The application programmer does the rest –excessive sequential sections –bad load balance –non–optimized communication patterns –excessive synchronization Performance analysis tools can –help to diagnose system–level performance problems –help to identify user–level performance bottlenecks –assist the users in improving their applications

© Pallas GmbH Achieved performance vs. effort Effort Code Performance OpenMP MPI Code doesn’t work Performance tools KAP, Debuggers

© Pallas GmbH Performance aspects Sequential performance –Optimize memory accesses –Optimize instruction sequences Parallel performance –Minimize sequential sections and replicated work –Optimize load balance and communication –Reduce synchronization Parallel correctness –Analyze results –Analyze execution traces –Compare parallel vs. sequential code

© Pallas GmbH Kinds of performance tools Sequential performance –Profiling tools –Compiler– and hardware–specific Parallel performance –Static code analysis –Automatic parallelisation –Counter–based profiling tools –Event tracing tools (analysis, prediction) Parallel correctness –Static code analysis tools –Trace–based verification

© Pallas GmbH Vendor–specific vs. portable tools Vendor–specific tools –Superior support for platform specifics –Proprietary data formats, API’s and user interfaces –Very useful for sequential optimizations, vendor–specific parallel models Portable tools –Concentrate on (portable) programming model –Open data formats and API’s –Useful for parallel optimizations, portable parallel models Examples –Guide (counter–based profiling) –Vampir, Dimemas, jumpshot (event trace analysis) –Assure (trace–based code verification)

© Pallas GmbH Performance tools – goals? Holy grail –Automatic parallelisation and optimization –One code version for sequential and parallel –One code version for all platforms –Automatic code verification –Automatic performance verification –Automatic detection of performance problems –Integration of performance analysis and parallelisation

© Pallas GmbH Performance tools – reality? Open problems –Limited capabilities of automatic parallelisation –Performance portability portable sequential optimizations portable parallel optimizations –Code version maintenance –Verification of MPI applications –Scaling to large, hierarchical systems

© Pallas GmbH MPI performance specifics Static SPMD–model, weak synchronization No sequential sections – work is replicated or sequential communication patterns are used Data distribution defined by communication Work distribution determined by data distribution Explicit communication and synchronization Optimization areas –Load balancing (tune data distribution) –Parallelize replicated work –Tune communication patterns –Reduce synchronization

© Pallas GmbH Event–based MPI Analysis Record trace of application execution –Calls to MPI and user routines –MPI communication events –Source locations –Values of performance registers or program variables From a trace, a performance analysis tool can show –Protocol of execution over time –Statistics for MPI routine execution –Statistics for communication –Dynamic calling tree Important advantage –Focus on any phase of the execution

© Pallas GmbH Vampirtrace details Vampirtrace ™ –Instrumentation library producing traces for Vampir and Dimemas –Supports MPI–1 (incl. collective operations) and MPI–I/O –Exploits MPI profiling interface –Works with vendors MPI implementations –API for user–level instrumentation –Capability to filter for event subsets Developed, productized and marketed by Pallas Available for IBM SP, PE 3.x

© Pallas GmbH Vampir details Vampir ™ –Event–trace visualization tool –Analyzes MPI and user routines –Analyzes point–to–point, collective and MPI–IO operations –Focus on arbitrary execution phases –Execution and communication statistics –Filter processes, messages, and user/MPI routines Jointly developed by TU Dresden and Pallas Productized and marketed by Pallas Available for IBM RS6000, AIX 4.2/AIX 4.3

© Pallas GmbH Dimemas details Dimemas –Event–based performance prediction tool –Parameterized machine model CPU performance Communication and network performance –Predicts performance on modeled platform –What–if analysis determined influence of parameters Jointly developed by UPC Barcelona and Pallas Productized and marketed by Pallas Available for IBM RS6000, AIX 4.2/AIX 4.3

© Pallas GmbH Vampirtrace options Filter events for –Processes –Time interval or record count –Event type Instrumentation (user routines, counters) –portable: by hand –some platforms (Fujitsu, Hitachi, NEC): by compiler Limit memory use –Spill data to disk, store all events –Only store n first/last events

© Pallas GmbH Vampir main window Vampir 2.5 main window Tracefile loading can be interrupted at any time Tracefile loading can be resumed Tracefile can be loaded starting at a specified time offset Tracefile can be re–written

© Pallas GmbH Aggregated profiling information –Execution time –Number of calls Inclusive or exclusive of called routines Summary chart

© Pallas GmbH Vampir state model User specifies activities and symbol grouping Look at all/any activities or all symbols Summary chart Calculation Tracing MPI MPI_Send MPI_Recv MPI_Wait ssor exchange Activities Symbols

© Pallas GmbH Timeline display To zoom, mark region with the mouse

© Pallas GmbH Timeline display – zoomed

© Pallas GmbH Timeline display – contents Shows all selected processes Shows state changes (activity color) Shows messages, collective and MPI–IO operations Can show parallelism display at the bottom

© Pallas GmbH Timeline display – message details Click on message line Message receive op Message send op Message information

© Pallas GmbH Communication statistics Message statistics for each process/node pair: –Byte and message count –min/max/avg message length, bandwidth

© Pallas GmbH Message histograms Message statistics by length, tag or communicator –Byte and message count –Min/max/avg bandwidth

© Pallas GmbH Collective operations For each process: mark operation locally Connect start/stop points by lines Start of op Data being sent Data being received Stop of op Connection lines

© Pallas GmbH Collective operations Click on collective operation display See global timing info See local timing info

© Pallas GmbH Collective operations Filter collective operations Change display style

© Pallas GmbH Collective operations statistics Statistics for collective operations: –operation counts, Bytes sent/received –transmission rates Filter for collective operation MPI_Gather only All collective operations

© Pallas GmbH I/O transfers are shown as lines MPI–I/O operations Click on I/O line See detailed I/O information

© Pallas GmbH MPI–I/O statistics Statistics for MPI–I/O transfers by file –Operation counts –Bytes read/written, transmission rates

© Pallas GmbH Activity chart Profiling information for all processes

© Pallas GmbH Global calling tree Display for each symbol: –Number of calls, min/max. execution time Fold/unfold or restrict to subtrees

© Pallas GmbH Process–local displays Timeline (showing calling levels) Activity chart Calling tree (showing number of calls)

© Pallas GmbH Other displays Parallelism display Pending Messages display Trace Comparison feature –compare different runs (scalability analysis) –compare different processes

© Pallas GmbH Focus on a time interval Chose a time interval by zooming with the timeline display Enable the Show Timeline Portion option All statistics windows are updated for the selected interval Use to focus on one application phase or iteration!

© Pallas GmbH Effects of zooming Select one iteration Updated summary Updated message statistics

© Pallas GmbH Compare traces Compare profiling information –To check load balance (between processes) –To evaluate scalability (different runs) –To look at optimization effects (different code versions) Compare processes 6 and 19 Comparison by routine

© Pallas GmbH Coupling Vampir and Dimemas Actual program run vs. Ideal communication

© Pallas GmbH Vampir/Vampirtrace roadmap Ongoing developments –Scalability enhancements –Functionality enhancements –Instrumentation enhancements Will be first available commercially on NEC and Compaq platforms –Earth simulator –ASCI machines PathForward developments for ASCI machines

© Pallas GmbH Scalability challenges Scalability in processor count –ASCI–class machines have 1000s of processors –High–end systems have 100s of processors –Applications use most of them Scalability in time –Need to analyze actual production runs (hours/days) Scalability in detail –Record and analyze system–specific performance data –Support for threaded and hybrid models

© Pallas GmbH Scalability problems Counter–based profiling tools are basically OK –Severely limited in the level of detail –Can’t focus into parts of application run Event–based tools have problems –Event traces get really large –Display tools use huge amounts of memory –Many displays do not scale Example: Vampir tracefiles for NAS NPB–LU –128 processes: records(120 Mbyte) –256 processes: records(600 Mbyte) –512 processes: records(6 Gbyte)

© Pallas GmbH Threaded programming models Enhance Vampir to display –Thread fork/join –Thread synchronization –Show a timeline per thread / aggregate threads into single timeline –Display subroutine/code block execution for each thread Create instrumentation library for thread packages Integrate instrumentation capability into OpenMP systems

© Pallas GmbH Cluster node display Cluster information is already recorded Enhance Vampir to –show aggregate execution information per node –show communication volume per node

© Pallas GmbH Cluster timeline display Display node–level information Show communication volume within nodes Show communication between nodes as usual Allow to expand nodes into processes There may be more than two hierarchy levels...

© Pallas GmbH Cluster timeline display

© Pallas GmbH Structured tracefile format Subdivide the tracefile into frames –Time intervals, thread/process/node subsets Put frame data –All in one file (as today) –In multiple files (one per frame...) –On a parallel filesystem (exploit parallelism) Frame index file holds –Location of frame start/end –Frame statistic data for immediate display –“Frame thumbnail”

© Pallas GmbH Structured tracefile format Vampir loads the frame index Displays immediately available –Global profiling/communication statistics –By–frame profiling/communication statistics –Thumbnail timeline User gets overview of application run –Can load particular frame data –Can navigate between frames User can refine instrumentation/tracing –Get detailed trace of interesting frames

© Pallas GmbH Dynamic tracing control What can be controlled –Definition of frames –Data to be recorded per frame Control methods –Instrumentation with Vampirtrace API –Binary instrumentation (atom) or use of a debugger –Configuration file –Interactive control agent (debugger) Tracing the right data is an iterative process!

© Pallas GmbH Cluster timeline display For very large systems, still can’t look at complete system (too many nodes) Display “interesting” nodes only –Regarding communication volume/delays –Regarding load imbalance –Regarding execution times of particular code modules

© Pallas GmbH Scalable Vampir structure Scalable user–interface Scalable internals Data Control Vampir SC User Interaction Trace Data Processing Trace Data I/O Data Control Vampir DC User Interaction Trace Data Analysis Display Handling Structured Trace Data runs on WS runs on parallel system may exploit parallel FS

© Pallas GmbH Access to Pallas tools Download free evaluation copies from