KOJAK Evaluation Report Adam Leko, Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative.

Slides:

Advertisements

Similar presentations

Configuration management

Advertisements

® IBM Software Group © 2010 IBM Corporation What’s New in Profiling & Code Coverage RAD V8 April 21, 2011 Kathy Chan

Automated Instrumentation and Monitoring System (AIMS)

Technical BI Project Lifecycle

Tools for applications improvement George Bosilca.

Selecting Preservation Strategies for Web Archives Stephan Strodl, Andreas Rauber Department of Software.

16/13/2015 3:30 AM6/13/2015 3:30 AM6/13/2015 3:30 AMIntroduction to Software Development What is a computer? A computer system contains: Central Processing.

Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

Sameer Shende Department of Computer and Information Science Neuro Informatics Center University of Oregon Tool Interoperability.

The TAU Performance Technology for Complex Parallel Systems (Performance Analysis Bring Your Own Code Workshop, NRL Washington D.C.) Sameer Shende, Allen.

On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.

Intel Trace Collector and Trace Analyzer Evaluation Report Hans Sherburne, Adam Leko UPC Group HCS Research Laboratory University of Florida Color encoding.

Copyright © 2001 by Wiley. All rights reserved. Chapter 1: Introduction to Programming and Visual Basic Computer Operations What is Programming? OOED Programming.

Overview of Eclipse Parallel Tools Platform Adam Leko UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red:

GASP: A Performance Tool Interface for Global Address Space Languages & Libraries Adam Leko 1, Dan Bonachea 2, Hung-Hsun Su 1, Bryan Golden 1, Hans Sherburne.

1 Introduction to Tool chains. 2 Tool chain for the Sitara Family (but it is true for other ARM based devices as well) A tool chain is a collection of.

Principles of Programming Chapter 1: Introduction  In this chapter you will learn about:  Overview of Computer Component  Overview of Programming 

PAPI Tool Evaluation Bryan Golden 1/4/2004 HCS Research Laboratory University of Florida.

Tool Visualizations, Metrics, and Profiled Entities Overview Adam Leko HCS Research Laboratory University of Florida.

DKRZ Tutorial 2013, Hamburg Analysis report examination with CUBE Markus Geimer Jülich Supercomputing Centre.

Analysis report examination with CUBE Alexandru Calotoiu German Research School for Simulation Sciences (with content used with permission from tutorials.

UPC/SHMEM PAT High-level Design v.1.1 Hung-Hsun Su UPC Group, HCS lab 6/21/2005.

MpiP Evaluation Report Hans Sherburne, Adam Leko UPC Group HCS Research Laboratory University of Florida.

© 2012 IBM Corporation Rational Insight | Back to Basis Series Chao Zhang Unit Testing.

Paradyn Week – April 14, 2004 – Madison, WI DPOMP: A DPCL Based Infrastructure for Performance Monitoring of OpenMP Applications Bernd Mohr Forschungszentrum.

MPE/Jumpshot Evaluation Report Adam Leko Hans Sherburne, UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

 To explain the importance of software configuration management (CM)  To describe key CM activities namely CM planning, change management, version management.

Adventures in Mastering the Use of Performance Evaluation Tools Manuel Ríos Morales ICOM 5995 December 4, 2002.

9 Chapter Nine Compiled Web Server Programs. 9 Chapter Objectives Learn about Common Gateway Interface (CGI) Create CGI programs that generate dynamic.

University of Maryland The DPCL Hybrid Project James Waskiewicz.

Bug Localization with Machine Learning Techniques Wujie Zheng

11 July 2005 Tool Evaluation Scoring Criteria Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant Mr. Adam Leko,

MPICL/ParaGraph Evaluation Report Adam Leko, Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information.

John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit.

Overview of CrayPat and Apprentice 2 Adam Leko UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative.

SvPablo Evaluation Report Hans Sherburne, Adam Leko UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red:

Performance Analysis Tool List Hans Sherburne Adam Leko HCS Research Laboratory University of Florida.

Declaratively Producing Data Mash-ups Sudarshan Murthy 1, David Maier 2 1 Applied Research, Wipro Technologies 2 Department of Computer Science, Portland.

The Alternative Larry Moore. 5 Nodes and Variant Input File Sizes Hadoop Alternative.

CEPBA Tools (DiP) Evaluation Report Adam Leko Hans Sherburne, UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information.

Dynaprof Evaluation Report Adam Leko, Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red:

Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,

HPCToolkit Evaluation Report Hans Sherburne, Adam Leko UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red:

Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.

Dynaprof Evaluation Report Adam Leko, Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red:

Overview of dtrace Adam Leko UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green: Positive.

TAU Evaluation Report Adam Leko, Hung-Hsun Su UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative.

Overview of AIMS Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green:

Threaded Programming Lecture 2: Introduction to OpenMP.

21 Sep UPC Performance Analysis Tool: Status and Plans Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant.

Performance Testing Test Complete. Performance testing and its sub categories Performance testing is performed, to determine how fast some aspect of a.

Single Node Optimization Computational Astrophysics.

SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.

Testing plan outline Adam Leko Hans Sherburne HCS Research Laboratory University of Florida.

PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Analysis report examination with CUBE Markus Geimer Jülich Supercomputing.

Presented by Jack Dongarra University of Tennessee and Oak Ridge National Laboratory KOJAK and SCALASCA.

Tuning Threaded Code with Intel® Parallel Amplifier.

Profiling/Tracing Method and Tool Evaluation Strategy Summary Slides Hung-Hsun Su UPC Group, HCS lab 1/25/2005.

Some of the utilities associated with the development of programs. These program development tools allow users to write and construct programs that the.

Parallel Performance Wizard: A Generalized Performance Analysis Tool Hung-Hsun Su, Max Billingsley III, Seth Koehler, John Curreri, Alan D. George PPW.

Online Performance Analysis and Visualization of Large-Scale Parallel Applications Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance Research.

Kai Li, Allen D. Malony, Sameer Shende, Robert Bell

MASS Java Documentation, Verification, and Testing

In-situ Visualization using VisIt

TAU integration with Score-P

Software visualization and analysis tool box

Presentation transcript:

KOJAK Evaluation Report Adam Leko, Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green: Positive note

2 Basic Information Name: KOJAK Developer: Forschungszentrum Jülich, UTK Current versions:  Stable: KOJAK-v2.0  Development: KOJAK v2.1b1  Website: Contacts:  Felix Wolf  Bernd Mohr  Generic

3 KOJAK Overview A collection of tools for automated performance analysis  Instrumentation utilities: DUCTAPE, OPARI  Trace file format/library: EPILOG  High-level trace API: EARL  Pattern matching/performance knowledge representation: EXPERT  Visualization tool: CUBE Also can export to Vampir’s VT3 format Acronym soup  KOJAK: Kit for Objective Judgement and Knowledge-based detection of performance bottlenecks  DUCTAPE: C++ program Database Utilities and Conversion Tools APplication Environment  EPILOG: Event Processing, Investigating and LOGging  EARL: Event Analysis and Recognition Library  EXPERT: Extensible Performance Tool  OPARI: OpenMP Pragma And Region Instrumentor  CUBE: CUBE Uniform Behavioral Encoding

4 KOJAK Architecture

5 Instrumentation Overview Automatic instrumentation ( kinst )  Only available on a few platforms Linux clusters, PGI compilers Hitachi SR-8000 Solaris, Sun Fortran90 compiler NEC SX  Based on undocumented compiler features Manual instrumentation  MPI profiling interface Just need to link against the elg.mpi library Only instruments MPI calls  EPILOG API Place macros at start and end of every function  ELG_USER_START(“function-name”);  ELG_USER_END(“function-name”); Compile with -DEPILOG Binary instrumentation ( elg_dpcl )  Uses IBM’s DPCL library  Only available on AIX OpenMP instrumentation ( opari )  Accomplished via Source-to-source transforms Linking against POMP library  Only instruments OpenMP regions and constructs Still need to manually instrument functions or other code regions Note: website mentions instrumentation via DUCTAPE and TAU, but these have not been integrated into the available versions of KOJAK as of 3/05

6 Instrumentation Overhead: CAMEL Performed manual instrumentation of CAMEL  Attempt to get a rough estimate of overhead  Instrumented all functions Ran CAMEL with 1/64 th problem size  Execution was slowed down by an order of magnitude  Trace file size: 919M CAMEL contains several hundred thousand function calls in a given execution  Instrumented two functions within an inner loop Execution time increased by a factor of 2.2 Trace file size: 153MB  Instrumented outside large loops Execution time increased by a few percent Trace file only 9.1KB Clearly the naïve approach of “instrument all functions” is too expensive for KOJAK  Behavior is common for any tracing approach, though

7 Instrumentation Overhead: Test Suite Instrumentation performed using MPI profiling interface  Overall, instrumentation overhead very low (one of the lowest seen thus far)  Instrumentation with PAPI enabled (FLOPS, L1 data miss rate) has no measurable extra overhead  Ping-pong has highest reproducible overhead at 10% (worst case for MPI) Note: Benchmarks marked with * have high variability in runtimes

8 EPILOG Overview Binary trace file format used by KOJAK  Supports OpenMP, MPI, or hybrid applications  Fairly compact NAS LU, W workload, 8 processors: 23MB Roughly on par with size of SLOG-2 files  Documented Complete spec available on website Has an existing API (open source) for reading, writing EPILOG files Can also add information from hardware counters  PAPI supported Can be converted to VAMPIR format using elg2vtf  Requires vptmerge  Does not work with updated Intel version of Cluster Tools ( vptmerge not included)

9 EARL Overview Provides high-level access to trace events  Random access to trace events  Also provides links between related events API documented, spec available on website  Existing implementation also available (open source) for C++ and Python Machine model: clusters of SMPs

10 EXPERT Overview Performs automatic analysis of EPILOG traces  Main feature of KOJAK suite  Matches collection of “performance problems” (bottleneck patterns) against trace file Bottlenecks are specified using EARL  User can add in their own patterns using Python or C++  New C++ patterns have to be compiled back into EXPERT Detection method  Pattern objects register for certain types of trace events  Event trace reader performs callbacks when requested events are encountered  Pattern objects receive callback & update state information  If pattern object matches state to its performance problem, a bottleneck is reported Output from EXPERT is a.cube file which can be visualized using the CUBE tool

11 EXPERT Bottleneck List Grey boxes (leaf nodes) are bottlenecks that can be currently detected

12 EXPERT Analysis Times EXPERT scalability  Sequential tool; analysis time scales proportionally to trace file size  Balancing act Try to detect too many/too complex bottlenecks: analysis time becomes intractable Try to totally minimize analysis time: miss useful bottlenecks Current analysis speed tractable for trace files up to a few hundred MB  Plans to parallelize the analysis phase, but no implementation available yet BenchmarkTrace file sizeAnalysis time (default bottleneck list) CAMEL109KB0.551s LU23MB8m, s PPerf: Big message9KB0.020s PPerf: Diffuse procedure383KB0.166s PPerf: Hot procedure7.3KB0.017s PPerf: Intensive server230KB2.795s PPerf: Ping-pong2.3MB5.060s PPerf: Random barrier134KB0.748s PPerf: Small messages8.1MB2m, s PPerf: System time7.3KB0.025s PPerf: Wrong way1.2MB11.380s

13 CUBE Overview Generic visualization tool  Used by KOJAK to visualize EXPERT’s analyses X-Windows application (uses wxWindows toolkit) Buzzword description  Displays multidimensional data in a scalable fashion  Reduces all data to hierarchical display of 3 dimensions (“cube”) Data is aggregated across dimensions as needed Dimension space  Set of metrics (M)  Set of call paths (C)  Set of locations (L) Each data point (m, c, l) is mapped onto a number representing  actual metric m (also referred to as severity)  while program was execution call path c  at location l  Browsers for each dimension are linked together User views one dimension with respect to another Uses documented XML format to represent data

14 CUBE Overview: Simple Description Uses a 3-pane approach to display information  Metric pane  Module/calltree pane Right-clicking brings up source code location  Location pane (system tree) Each item is displayed along with a color to indicate severity of condition  Severity can be expressed 4 ways Absolute (time) Percentage Relative percentage (changes module & location pane) Comparative percentage (differences between executions) Despite documentation, interface is actually quite intuitive

15 CUBE Example: CAMEL After opening the.cube file (default metric shown = absolute time take in seconds)

16 CUBE Example: CAMEL After expanding all 3 root nodes; color shown indicates metric “severity” (amount of time)

17 CUBE Example: CAMEL Selecting “Execution” shows execution time, broken down into part of code & machine

18 CUBE Example: CAMEL Selecting mainloop adjusts system tree to only show time spent in mainloop per each processor

19 CUBE Example: CAMEL Expanded nodes show exclusive metric (only time spent by node)

20 CUBE Example: CAMEL Collapsed nodes show inclusive metric (time spent by node and all children nodes)

21 CUBE Example: CAMEL Metric pane also shows detected bottlenecks; here, shows “Late Sender” in MPI_Recv within main spread across all nodes

22 Bottleneck Identification Test Suite Testing metric: what did CUBE tell us after processing trace file with EXPERT?  Excluding what can be accomplished with VAMPIR export Programs correctness not affected by instrumentation CAMEL: PASSED  Not many problems detected  “Late sender” attributed to a few places in code, due to CAMEL’s unique communication pattern LU: TOSS-UP  No “too many small messages” bottleneck pattern  Late sender, messages in wrong order correctly identified though Big messages: PASSED  Showed most time being spent in MPI_Send / MPI_Recv Diffuse procedure: FAILED  Just showed lots of time being spent in barriers Hot procedure: FAILED  Time incorrectly attributed to MPI_Init

23 Bottleneck Identification Test Suite (2) Intensive server: PASSED  Late sender bottleneck detected for overloaded server Ping-pong: PASSED  Late sender bottleneck detected  Indicates dependence of messages on each other Random barrier: PASSED  Detected “wait at barrier” bottleneck  Source code correlation allowed pinpointing where problem was in code Small messages: TOSS-UP  Illustrated large time spent in point-to-point MPI routines  Bottleneck incorrectly attributed to late receiver System time: FAILED  Incorrectly attributed to MPI_Init time Wrong order: PASSED  Correctly identified messages received in wrong order

24 KOJAK General Comments Good things  Portable, automatic performance analysis  CUBE GUI uses novel way to present metrics Source code correlation! Bottlenecks are shown according to which parts of code they occur in and which machines see them Data presentation in a form that makes it easier for user to not become overwhelmed  Libraries are well-separated into APIs and documented We have the opportunity to re-use their existing code!  Automatic instrumentation is available, although only for a limited number of platforms  Installation relatively easy Code compiled pretty cleanly  Can still export data into VAMPIR format for more thorough user analysis  Tool very stable (no crashes, only a few bugs)

25 KOJAK General Comments (2) Things that could use improvement  Only a few PAPI metrics shown in GUI FLOPS & L1 data miss rates No PAPI metrics used for bottleneck detection!  Could write new pattern in EARL though  When using PAPI, trace file creation fails Complains about out-of-sync files  Some time at beginning of application gets incorrectly recorded under MPI_Init  CUBE becomes does not correlate with source code unless automatic/binary instrumentation is used Call tree in second pane turns into flat structure when only MPI profiling library interface is used  Impossible to see specific communication patterns in CUBE Exporting to VAMPIR trace format possible, but relies on hard-to-find tool vptmerge Effectiveness of automatic analysis on a day-to-day basis still unknown  However, very powerful tool when combined with VAMPIR

26 KOJAK: Adding UPC & SHMEM SHMEM  Not much extra work needed Need to create a SHMEM profiling interface that writes to EPILOG  Add a few extra SHMEM-specific bottleneck patterns UPC  Could potentially be difficult If we solve the UPC instrumentation problem, then we just need to use EPILOG instead of (other trace format) Could use manual instrumentation for everything but implicit communication  Add (many?) UPC-specific bottleneck patterns In either case, if manual (or source-source) instrumentation used, not much additional code has to be written  Also, since formats defined (and existing API implementations are readily available), it should be relatively easy to export to EPILOG traces

27 Evaluation (1) Available metrics: 4/5  Supports recording execution time (broken down into call trees)  Supports recording communication patterns + classification of events  Supports a few PAPI metrics Cost: 5/5  Free! Documentation quality: 4/5  Excellent “USAGE” file describes how to use application  CUBE documentation overly technical in some areas Extensibility: 4/5  Can easily add new benchmark patterns  Open source, uses documented APIs Filtering and aggregation: 3/5  Simple filtering & aggregation functionality in CUBE GUI  Not supported at the tracefile level, though Cannot restrict analysis to only certain parts of trace  More complicated filtering is done based on bottleneck detection algorithms

28 Evaluation (2) Hardware support: 5/5  Many platforms supported  Instrumentation, Measurement, and Analysis 64-bit Linux (Opteron and Itanium) with GNU, PGI, or Intel compilers; IBM SP (AIX); SGI MIPS-based clusters (O2k, O3k); SGI Altix; SPARC-based clusters; AlphaServer (Tru64)  Instrumentation and Measurement only Cray X1 and T3E; IBM BlueGene/L; NEC SX; Hitachi SR-8000 Heterogeneity support: 0/5 (not supported) Installation: 4.5/5  Comes in source form, but very easily to compile & installation (no problems) Interoperability: 2/5  CUBE viewer uses simple XML-based format  Can only export to VAMPIR trace files Learning curve: 3.5/5  MPI trace library easy to use, EXPERT very easy to use  CUBE has a learning curve but is easy to use after some use

29 Evaluation (3) Manual overhead: 3/5  Automatic instrumentation of MPI calls on all platforms  Automatic instrumentation of all functions and a handful of functions via DPCL  MPI and OpenMP instrumentation support Measurement accuracy: 5/5  CAMEL overhead < 1%  Binary instrumentation more accurate but only available on AIX  Very low overhead for instrumenting MPI calls only Multiple executions: 3/5  Can relate all metrics between two different runs (show percentage differences)  Can change code and still compare runs Multiple analyses & views: 3.5/5  CUBE can show time-based metrics broken down by node and code locations  CUBE can also show bottleneck detection metrics broken down by node and code locations  Can export to VAMPIR to see trace

30 Evaluation (4) Performance bottleneck identification: 3.5/5  Bottleneck rules work pretty well (could use more though)  Lack of built-in trace viewer makes identification of some bottlenecks impossible, but trace export means could combine with Vampir to cover most bases Profiling/tracing support: 3/5  Only performs tracing  Trace file format relatively compact  Profiling data shown in CUBE extracted from trace data Response time: 1/5  Have to wait until after program finishes executing and EXPERT is done analyzing before you get any feedback Software support: 3.5/5  Supports OpenMP, MPI  Can support linking against any library, but does not instrument library functions Source code correlation: 4/5  Well-supported in CUBE, down to the source code line level for function defitions and function calls Searching: 0/5 (not supported)

31 Evaluation (5) System stability: 4.5/5  No program crashes encountered  A few minor bugs discovered Technical support: 4.5/5  Developers responded within 24 hours  Gave back much useful information  Willing to work with us to add UPC and SHMEM support