Allen D. Malony Performance Research Laboratory (PRL) Neuroinformatics Center (NIC) Department.

Allen D. Malony malony@cs.uoregon.edu http://www.cs.uoregon.edu/research/tau Performance Research Laboratory (PRL) Neuroinformatics Center (NIC) Department of Computer and Information Science University of Oregon Performance Technology for Productive, High-End Parallel Computing: the TAU Parallel Performance System

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Outline  Research interests and motivation  TAU performance system  Instrumentation  Measurement  Analysis tools  Parallel profile analysis (ParaProf)  Performance data management (PerfDMF)  Performance data mining (PerfExplorer)  Open Trace Format (OTF)  Conclusions and Future Work

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Research Motivation  Tools for performance problem solving  Empirical-based performance optimization process  Performance technology concerns characterization Performance Tuning Performance Diagnosis Performance Experimentation Performance Observation hypotheses properties Instrumentation Measurement Analysis Visualization Experiment management Performance data storage Performance data mining Model-based

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Challenges in Performance Problem Solving  How to make the process more effective (productive)?  Process likely to change as parallel systems evolve  What are the important events and performance metrics?  Tied to application structure and computational model  Tied to application domain and algorithms  What are the significant issues that will affect the technology used to support the process?  Enhance application development and optimization  Process and tools can/must be more application-aware  Tools have poor support for application-specific aspects  Integrate performance technology and process

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Performance Process, Technology, and Scale  How does our view of this process change when we consider very large-scale parallel systems?  Scaling complicates observation and analysis  Performance data size  standard approaches deliver a lot of data with little value  Measurement overhead and intrusion  tradeoff with analysis accuracy  “noise” in the system  Analysis complexity increases  What will enhance productive application development?  Process and technology evolution  Nature of application development may change

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Role of Intelligence, Automation, and Knowledge  Scale forces the process to become more intelligent  Even with intelligent and application-specific tools, the decisions of what to analyze is difficult and intractable  More automation and knowledge-based decision making  Build automatic/autonomic capabilities into the tools  Support broader experimentation methods and refinement  Access and correlate data from several sources  Automate performance data analysis / mining / learning  Include predictive features and experiment refinement  Knowledge-driven adaptation and optimization guidance  Address scale issues through increased expertise

HLRS 2006Performance Technology for Productive, High-End Parallel Computing TAU Performance System  Tuning and Analysis Utilities (14+ year project effort)  Performance system framework for HPC systems  Integrated, scalable, flexible, and parallel  Targets a general complex system computation model  Entities: nodes / contexts / threads  Multi-level: system / software / parallelism  Measurement and analysis abstraction  Integrated toolkit for performance problem solving  Instrumentation, measurement, analysis, and visualization  Portable performance profiling and tracing facility  Performance data management and data mining  Partners: LLNL, ANL, Research Center Jülich, LANL

HLRS 2006Performance Technology for Productive, High-End Parallel Computing TAU Parallel Performance System Goals  Portable (open source) parallel performance system  Computer system architectures and operating systems  Different programming languages and compilers  Multi-level, multi-language performance instrumentation  Flexible and configurable performance measurement  Support for multiple parallel programming paradigms  Multi-threading, message passing, mixed-mode, hybrid, object oriented (generic), component-based  Support for performance mapping  Integration of leading performance technology  Scalable (very large) parallel performance analysis

HLRS 2006Performance Technology for Productive, High-End Parallel Computing memory Node VM space Context SMP Threads node memory … … Interconnection Network Inter-node message communication * * physical view model view General Complex System Computation Model  Node: physically distinct shared memory machine  Message passing node interconnection network  Context: distinct virtual memory space within node  Thread: execution threads (user/system) in context

HLRS 2006Performance Technology for Productive, High-End Parallel Computing TAU Performance System Architecture

HLRS 2006Performance Technology for Productive, High-End Parallel Computing TAU Instrumentation Approach  Support for standard program events  Routines, classes and templates  Statement-level blocks  Support for user-defined events  Begin/End events (“user-defined timers”)  Atomic events (e.g., size of memory allocated/freed)  Selection of event statistics  Support definition of “semantic” entities for mapping  Support for event groups (aggregation, selection)  Instrumentation optimization  Eliminate instrumentation in lightweight routines

HLRS 2006Performance Technology for Productive, High-End Parallel Computing TAU Instrumentation Mechanisms  Source code  Manual (TAU API, TAU component API)  Automatic (robust)  C, C++, F77/90/95 (Program Database Toolkit (PDT))  OpenMP (directive rewriting (Opari), POMP2 spec)  Object code  Pre-instrumented libraries (e.g., MPI using PMPI)  Statically-linked and dynamically-linked  Executable code  Dynamic instrumentation (pre-execution) (DynInstAPI)  Virtual machine instrumentation (e.g., Java using JVMPI)  TAU_COMPILER to automate instrumentation process

HLRS 2006Performance Technology for Productive, High-End Parallel Computing User-level abstractions problem domain source code object codelibraries instrumentation executable runtime image compiler linkerOS VM instrumentation performance data run preprocessor Multi-Level Instrumentation and Mapping  Multiple interfaces  Information sharing  Between interfaces  Event selection  Within/between levels  Mapping  Associate performance data with high-level semantic abstractions

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Program Database Toolkit (PDT) Application / Library C / C++ parser Fortran parser F77/90/95 C / C++ IL analyzer Fortran IL analyzer Program Database Files IL DUCTAPE PDBhtml SILOON CHASM tau_instrument or Program documentation Application component glue C++ / F90/95 interoperability Automatic source instrumentation

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Program Database Toolkit (PDT)  Program code analysis framework  Develop source-based tools  High-level interface to source code information  Integrated toolkit for source code parsing, database creation, and database query  Commercial grade front-end parsers  Portable IL analyzer, database format, and access API  Open software approach for tool development  Multiple source languages  Implement automatic performance instrumentation tools  tau_instrumentor

HLRS 2006Performance Technology for Productive, High-End Parallel Computing TAU Measurement Approach  Portable and scalable parallel profiling solution  Multiple profiling types and options  Event selection and control (enabling/disabling, throttling)  Online profile access and sampling  Online performance profile overhead compensation  Portable and scalable parallel tracing solution  Trace translation to EPILOG, VTF3, and OTF  Trace streams (OTF) and hierarchical trace merging  Robust timing and hardware performance support  Multiple counters (hardware, user-defined, system)  Performance measurement for CCA component software

HLRS 2006Performance Technology for Productive, High-End Parallel Computing TAU Measurement Mechanisms  Parallel profiling  Function-level, block-level, statement-level  Supports user-defined events and mapping events  TAU parallel profile stored (dumped) during execution  Support for flat, callgraph/callpath, phase profiling  Support for memory profiling  Tracing  All profile-level events  Inter-process communication events  Inclusion of multiple counter data in traced events

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Types of Parallel Performance Profiling  Flat profiles  Metric (e.g., time) spent in an event (callgraph nodes)  Exclusive/inclusive, # of calls, child calls  Callpath profiles (Calldepth profiles)  Time spent along a calling path (edges in callgraph)  “main=> f1 => f2 => MPI_Send” (event name)  TAU_CALLPATH_LENGTH environment variable  Phase profiles  Flat profiles under a phase (nested phases are allowed)  Default “main” phase  Supports static or dynamic (per-iteration) phases

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Performance Analysis and Visualization  Analysis of parallel profile and trace measurement  Parallel profile analysis  ParaProf: parallel profile analysis and presentation  ParaVis: parallel performance visualization package  Profile generation from trace data (tau2pprof)  Performance data management framework (PerfDMF)  Parallel trace analysis  Translation to VTF (V3.0), EPILOG, OTF formats  Integration with VNG (Technical University of Dresden)  Online parallel analysis and visualization  Integration with CUBE browser (KOJAK, UTK, FZJ)

HLRS 2006Performance Technology for Productive, High-End Parallel Computing ParaProf Parallel Performance Profile Analysis HPMToolkit MpiP TAU Raw files PerfDMF managed (database) Metadata Application Experiment Trial

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Example Applications  sPPM  ASCI benchmark, Fortran, C, MPI, OpenMP or pthreads  Miranda  research hydrodynamics code, Fortran, MPI  GYRO  tokamak turbulence simulation, Fortran, MPI  FLASH  physics simulation, Fortran, MPI  WRF  weather research and forecasting, Fortran, MPI  S3D  3D combustion, Fortran, MPI

HLRS 2006Performance Technology for Productive, High-End Parallel Computing ParaProf – Flat Profile (Miranda, BG/L) 8K processors node, context, thread Miranda  hydrodynamics  Fortran + MPI  LLNL Run to 64K

HLRS 2006Performance Technology for Productive, High-End Parallel Computing ParaProf – Stacked View (Miranda)

HLRS 2006Performance Technology for Productive, High-End Parallel Computing ParaProf – Callpath Profile (Flash) Flash  thermonuclear flashes  Fortran + MPI  Argonne

HLRS 2006Performance Technology for Productive, High-End Parallel Computing ParaProf – Histogram View (Miranda) 8k processors 16k processors

HLRS 2006Performance Technology for Productive, High-End Parallel Computing NAS BT – Flat Profile How is MPI_Wait() distributed relative to solver direction? Application routine names reflect phase semantics

HLRS 2006Performance Technology for Productive, High-End Parallel Computing NAS BT – Phase Profile (Main and X, Y, Z) Main phase shows nested phases and immediate events

HLRS 2006Performance Technology for Productive, High-End Parallel Computing ParaProf – 3D Full Profile (Miranda) 16k processors

HLRS 2006Performance Technology for Productive, High-End Parallel Computing ParaProf – 3D Full Profile (Flash) 128 processors

HLRS 2006Performance Technology for Productive, High-End Parallel Computing ParaProf Bar Plot (Zoom in/out +/-)

HLRS 2006Performance Technology for Productive, High-End Parallel Computing ParaProf – 3D Scatterplot (Miranda)  Each point is a “thread” of execution  A total of four metrics shown in relation  ParaVis 3D profile visualization library  JOGL

HLRS 2006Performance Technology for Productive, High-End Parallel Computing ParaProf – Callgraph Zoom (Flash) Zoom in (+) Zoom out (-)

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Performance Tracing on Miranda  Use TAU to generate VTF3 traces for Vampir analysis  MPI calls with HW counter information (not shown)  Detailed code behavior to focus optimization efforts

HLRS 2006Performance Technology for Productive, High-End Parallel Computing S3D on Lemieux (TAU-to-VTF3, Vampir) S3D  3D combustion  Fortran + MPI  PSC

HLRS 2006Performance Technology for Productive, High-End Parallel Computing S3D on Lemieux (Zoomed)

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Runtime MPI Shared Library Instrumentation  We can now interpose the MPI wrapper library for applications that have already been compiled (no re- compilation or re-linking necessary!)  Uses LD_PRELOAD for Linux  Soon on AIX using MPI_EUILIB/MPI_EUILIBPATH  Simply compile TAU with MPI support and prefix your MPI program with tau_load.sh  Requires shared library MPI % mpirun –np 4 tau_load.sh a.out

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Workload Characterization  Idea  Partition performance data for individual functions  Based on runtime parameters  Enable by configuring with –PROFILEPARAM  TAU_PROFILE_PARAM1L (value, “name”)  Simple example: void foo(int input) { TAU_PROFILE("foo", "", TAU_DEFAULT); TAU_PROFILE_PARAM1L(input, "input");... }

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Workload Characterization  5 seconds spent in function “ foo ” becomes  2 seconds for “ foo [ = ] ”  1 seconds for “ foo [ = ] ”  …  Currently used in MPI wrapper library  Allows for partitioning of time spent in MPI routines based on parameters (message size, message tag, destination node)  Can be extrapolated to infer specifics about the MPI subsystem and system as a whole

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Workload Characterization  Simple example, send/receive squared message sizes (0-32MB) #include int main(int argc, char **argv) { int rank, size, i, j; int buffer[16*1024*1024]; MPI_Init(&argc, &argv); MPI_Comm_size( MPI_COMM_WORLD, &size ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); for (i=0;i<1000;i++) for (j=1;j<16*1024*1024;j*=2) { if (rank == 0) { MPI_Send(buffer,j,MPI_INT,1,42,MPI_COMM_WORLD); } else { MPI_Status status; MPI_Recv(buffer,j,MPI_INT,0,42,MPI_COMM_WORLD,&status); } MPI_Finalize(); }

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Workload Characterization  Two different message sizes (~3.3MB and ~4K)

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Hypothetical Mapping Example  Particles distributed on surfaces of a cube Particle* P[MAX]; /* Array of particles */ int GenerateParticles() { /* distribute particles over all faces of the cube */ for (int face=0, last=0; face < 6; face++){ /* particles on this face */ int particles_on_this_face = num(face); for (int i=last; i < particles_on_this_face; i++) { /* particle properties are a function of face */ P[i] =... f(face);... } last+= particles_on_this_face; }

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Hypothetical Mapping Example (continued)  How much time (flops) spent processing face i particles?  What is the distribution of performance among faces?  How is this determined if execution is parallel? int ProcessParticle(Particle *p) { /* perform some computation on p */ } int main() { GenerateParticles(); /* create a list of particles */ for (int i = 0; i < N; i++) /* iterates over the list */ ProcessParticle(P[i]); } … engine work packets

HLRS 2006Performance Technology for Productive, High-End Parallel Computing No Performance Mapping versus Mapping  Typical performance tools report performance with respect to routines  Does not provide support for mapping  TAU’s performance mapping can observe performance with respect to scientist’s programming and problem abstractions TAU (no mapping) TAU (w/ mapping)

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Component-Based Scientific Applications  How to support performance analysis and tuning process consistent with application development methodology?  Common Component Architecture (CCA) applications  Performance tools should integrate with software  Design performance observation component  Measurement port and measurement interfaces  Build support for application component instrumentation  Interpose a proxy component for each port  Inside the proxy, track caller/callee invocations, timings  Automate the process of proxy component creation  using PDT for static analysis of components  include support for selective instrumentation

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Flame Reaction-Diffusion (Sandia) CCAFFEINE

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Earth Systems Modeling Framework  Coupled modeling with modular software framework  Instrumentation for ESMF framework and applications  PDT automatic instrumentation  Fortran 95 code modules  C / C++ code modules  MPI wrapper library for MPI calls  ESMF component instrumentation (using CCA)  CCA measurement port manual instrumentation  Proxy generation using PDT and runtime interposition  Significant callpath profiling used by ESMF team

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Using TAU Component in ESMF/CCA

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Important Questions for Application Developers  How does performance vary with different compilers?  Is poor performance correlated with certain OS features?  Has a recent change caused unanticipated performance?  How does performance vary with MPI variants?  Why is one application version faster than another?  What is the reason for the observed scaling behavior?  Did two runs exhibit similar performance?  How are performance data related to application events?  Which machines will run my code the fastest and why?  Which benchmarks predict my code performance best?

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Performance Problem Solving Goals  Answer questions at multiple levels of interest  Data from low-level measurements and simulations  use to predict application performance  High-level performance data spanning dimensions  machine, applications, code revisions, data sets  examine broad performance trends  Discover general correlations application performance and features of their external environment  Develop methods to predict application performance on lower-level metrics  Discover performance correlations between a small set of benchmarks and a collection of applications that represent a typical workload for a given system

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Automatic Performance Analysis Tool (Concept) Performance database Build application Execute application Simple analysis feedback 105% Faster! 72% Faster! build information environment / performance data Offline analysis

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Performance Data Management (PerfDMF) K. Huck, A. Malony, R. Bell, A. Morris, “Design and Implementation of a Parallel Performance Data Management Framework,” ICPP 2005. (awarded best paper)

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Performance Data Mining (Objectives)  Conduct parallel performance analysis in a systematic, collaborative and reusable manner  Manage performance complexity  Discover performance relationship and properties  Automate process  Multi-experiment performance analysis  Large-scale performance data reduction  Summarize characteristics of large processor runs  Implement extensible analysis framework  Abtraction / automation of data mining operations  Interface to existing analysis and data mining tools

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Performance Data Mining (PerfExplorer)  Performance knowledge discovery framework  Data mining analysis applied to parallel performance data  comparative, clustering, correlation, dimension reduction, …  Use the existing TAU infrastructure  TAU performance profiles, PerfDMF  Client-server based system architecture  Technology integration  Java API and toolkit for portability  PerfDMF  R-project/Omegahat, Octave/Matlab statistical analysis  WEKA data mining package  JFreeChart for visualization, vector output (EPS, SVG)

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Performance Data Mining (PerfExplorer) K. Huck and A. Malony, “PerfExplorer: A Performance Data Mining Framework For Large-Scale Parallel Computing,” SC 2005.

HLRS 2006Performance Technology for Productive, High-End Parallel Computing PerfExplorer Analysis Methods  Data summaries, distributions, scatterplots  Clustering  k-means  Hierarchical  Correlation analysis  Dimension reduction  PCA  Random linear projection  Thresholds  Comparative analysis  Data management views

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Cluster Analysis  Performance data represented as vectors - each dimension is the cumulative time for an event  k-means: k random centers are selected and instances are grouped with the "closest" (Euclidean) center  New centers are calculated and the process repeated until stabilization or max iterations  Dimension reduction necessary for meaningful results  Virtual topology, summaries constructed

HLRS 2006Performance Technology for Productive, High-End Parallel Computing sPPM Cluster Analysis

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Hierarchical and K-means Clustering (sPPM)

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Miranda Clusters, Average Values (16K CPUs)  Two primary clusters due to MPI_Alltoall behavior …  … also inverse relationship between MPI_Barrier and MPI_Group_translate_ranks

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Miranda Modified  After code modifications, work distribution is even  MPI_Barrier and MPI_Group_translate_ranks are no longer significant contributors to run time

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Flash Clustering on 16K BG/L Processors  Four significant events automatically selected  Clusters and correlations are visible

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Correlation Analysis  Describes strength and direction of a linear relationship between two variables (events) in the data

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Comparative Analysis  Relative speedup, efficiency  total runtime, by event, one event, by phase  Breakdown of total runtime  Group fraction of total runtime  Correlating events to total runtime  Timesteps per second

HLRS 2006Performance Technology for Productive, High-End Parallel Computing User-Defined Views  Reorganization of data for multiple parametric studies  Construction of views / sub-views with simple operators  Simple “wizard” like interface for creating view Application Processors Problem size Application Problem type Processors

HLRS 2006Performance Technology for Productive, High-End Parallel Computing PerfExplorer Future Work  Extensions to PerfExplorer framework  Examine properties of performance data  Automated guidance of analysis  Workflow scripting for repeatable analysis  Dependency modeling (go beyond correlation)  Time-series analysis of phase-based data

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Open Trace Format (OTF)  Features  Hierarchical trace format  Replacement for proprietary formats such as STF  Pallas and Intel  Efficient streams based parallel access  Tracing library available on IBM BG/L platform  Development of OTF supported by LLNL  Joint development effort  ZiH / Technical University of Dresden  ParaTools, Inc.  http://www.paratools.com/otf

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Open Trace Format (OTF)

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Vampir and VNG  Commercial trace based tools  Developed at ZiH, T.U. Dresden  Wolfgang Nagel, Holger Brunst and others…  http://www.vampir-ng.de  Vampir Trace Visualizer  Formerly known also as Intel ® Trace Analyzer v4.0  Based on sequential trace analysis  Vampir Next Generation (VNG)  Client (vng) runs on a desktop, server (vngd) on a cluster  Parallel trace analysis  Orders of magnitude bigger traces (more memory)  State of the art in parallel trace visualization

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Vampir Next Generation (VNG) Architecture Merged Traces Analysis Server Classic Analysis:  monolithic  sequential Worker 1 Worker 2 Worker m Master Trace 1 Trace 2 Trace 3 Trace N File System Internet Parallel Program Monitor System Event Streams Visualization Client Segment Indicator 768 Processes Thumbnail Timeline with 16 visible Traces Process Parallel I/O Message Passing

HLRS 2006Performance Technology for Productive, High-End Parallel Computing VNG Timeline Display (Miranda on BGL)

HLRS 2006Performance Technology for Productive, High-End Parallel Computing VNG Timeline Zoomed In

HLRS 2006Performance Technology for Productive, High-End Parallel Computing VNG Grouping of Interprocess Communications

HLRS 2006Performance Technology for Productive, High-End Parallel Computing VNG Process Timeline with PAPI Counters

HLRS 2006Performance Technology for Productive, High-End Parallel Computing VNG Calltree Display

HLRS 2006Performance Technology for Productive, High-End Parallel Computing OTF/VNG Support for Counters

HLRS 2006Performance Technology for Productive, High-End Parallel Computing TAU Tracing Enhancements  Configure TAU with -TRACE –vtf= –otf= options % configure –TRACE –vtf= … % configure –TRACE –otf= … Generates tau_merge, tau2vtf, tau2otf tools in / /bin % tau_f90.sh app.f90 –o app  Instrument and execute application % mpirun -np 4 app  Merge and convert trace files to VTF3/OTF format % tau_treemerge.pl % tau2vtf tau.trc tau.edf app.vpt.gz % vampir foo.vpt.gz OR % tau2otf tau.trc tau.edf app.otf –n % vampir app.otf OR use VNG to analyze OTF/VTF trace files

HLRS 2006Performance Technology for Productive, High-End Parallel Computing VNG Communication Matrix Display

HLRS 2006Performance Technology for Productive, High-End Parallel Computing VNG Process Activity Chart

HLRS 2006Performance Technology for Productive, High-End Parallel Computing TAU Eclipse Integration  Eclipse GUI integration of existing TAU tools  New Eclipse plug-in for code instrumentation  Integration with CDT and FDT  Java, C/C++, and Fortran projects  Can be instrumented and run from within eclipse  Each project can be given multiple build configurations corresponding to available TAU makefiles  All TAU configuration options are available  Paraprof tool can be launched automatically

HLRS 2006Performance Technology for Productive, High-End Parallel Computing TAU Eclipse Integration TAU configuration TAU experimentation

HLRS 2006Performance Technology for Productive, High-End Parallel Computing TAU Eclipse Future Work  Development of the TAU Eclipse plugins for Java and the CDT/FDT is ongoing  Planned features include:  Full integration with the Eclipse Parallel Tools project  Database storage of project performance data  Refinement of the plugin settings interface to allow easier selection of TAU runtime and compiletime options  Accessibility of TAU configuration and commandline tools via the Eclipse UI

HLRS 2006Performance Technology for Productive, High-End Parallel Computing ZeptoOS and TAU  DOE OS/RTS for Extreme Scale Scientific Computation  ZeptoOS  scalable components for petascale architectures  Argonne National Laboratory and University of Oregon  University of Oregon  Kernel-level performance monitoring  OS component performance assessment and tuning  KTAU (Kernel Tuning and Analysis Utilities)  integration of TAU infrastructure in Linux kernel  integration with ZeptoOS  installation on BG/L  Port to 32-bit and 64-bit Linux platforms

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Linux Kernel Profiling using TAU – Goals  Fine-grained kernel-level performance measurement  Parallel applications  Support both profiling and tracing  Both process-centric and system-wide view  Merge user-space performance with kernel-space  User-space: (TAU) profile/trace  Kernel-space: (KTAU) profile/trace  Detailed program-OS interaction data  Including interrupts (IRQ)  Analysis and visualization compatible with TAU

HLRS 2006Performance Technology for Productive, High-End Parallel Computing TAU Performance System Status  Computing platforms  IBM, SGI, Cray, HP, Sun, Hitachi, NEC, Linux clusters, Apple, Windows, …  Programming languages  C, C++, Fortran 90/95, UPC, HPF, Java, OpenMP, Python  Thread libraries  pthreads, SGI sproc, Java,Windows, OpenMP  Communications libraries  MPI-1/2, PVM, shmem, …  Compilers  IBM, Intel, PGI, GNU, Fujitsu, Sun, NAG, Microsoft, SGI, Cray, HP, NEC, Absoft, Lahey, PathScale, Open64

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Papers at European Conferences 2006  L. Li and A. Malony, “Model-based Performance Diagnosis for Master-Worker Parallel Computations,” EuroPar 2006.  A. Nataraj, A. Malony, A. Morris, and S. Shende, “Early Experiences with KTAU on the IBM BG/L,” EuroPar 2006.  L. Li and A. Malony, “Model-based Performance Diagnosis of Wavefront Parallel Computations,” HPCC 2006.  W. Spear, A. Malony, A. Morris, and S. Shende, “Integrating TAU with Eclipse: A Performance Analysis System in an Integrated Development Environment,” HPCC 2006.  K. Huck, A. Malony, S. Shende, and A. Morris, “TAUg: Runtime Global Performance Data Access using MPI,” EuroPVM-MPI 2006.  C. Hoge, A. Malony, and D. Keith, “Client-side Task Support in Matlab for Concurrent Distributed Execution,” DAPSYS 2006.  A. Nataraj, A. Malony, S. Shende, and A. Morris, “Kernel-level Measurement for Integrated Performance Views: the KTAU Project,” Cluster 2006, distinguished paper.

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Project Affiliations (selected)  Lawrence Livermore National Lab  Hydrodynamics (Miranda), radiation diffusion (KULL)  Open Trace Format (OTF) implementation on BG/L  Argonne National Lab  ZeptoOS project and KTAU  Astrophysical thermonuclear flashes (Flash)  Center for Simulation of Accidental Fires and Explosion  University of Utah, ASCI ASAP Center, C-SAFE  Uintah Computational Framework (UCF)  Oak Ridge National Lab  Contribution to the Joule Report (S3D, AORSA3D)

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Project Affiliations (continued)  Sandia National Lab  Simulation of turbulent reactive flows (S3D)  Combustion code (CFRFS)  Los Alamos National Lab  Monte Carlo transport (MCNP)  SAIC’s Adaptive Grid Eulerian (SAGE)  CCSM / ESMF / WRF climate/earth/weather simulation  NSF, NOAA, DOE, NASA, …  Common component architecture (CCA) integration  Performance Evaluation Research Center (PERC)  DOE SciDAC center

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Support Acknowledgements  Department of Energy (DOE)  Office of Science  MICS, Argonne National Lab  ASC/NNSA  University of Utah ASC/NNSA Level 1  ASC/NNSA, Lawrence Livermore National Lab  Department of Defense (DoD)  HPC Modernization Office (HPCMO)  Programming Environment and Training (PET)  NSF Software and Tools for High-End Computing  Research Centre Juelich  Los Alamos National Laboratory  ParaTools

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Acknowledgements  Dr. Sameer Shende, Senior Scientist  Alan Morris, Senior Software Engineer  Wyatt Spear, PRL staff  Scott Biersdorff, PRL staff  Kevin Huck, Ph.D. student  Aroon Nataraj, Ph.D. student  Kai Li, Ph.D. student  Li Li, Ph.D. student  Adnan Salman, Ph.D. student  Suravee Suthikulpanit, M.S. student

Allen D. Malony Performance Research Laboratory (PRL) Neuroinformatics Center (NIC) Department.

Similar presentations

Presentation on theme: "Allen D. Malony Performance Research Laboratory (PRL) Neuroinformatics Center (NIC) Department."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Allen D. Malony Performance Research Laboratory (PRL) Neuroinformatics Center (NIC) Department.

Similar presentations

Presentation on theme: "Allen D. Malony Performance Research Laboratory (PRL) Neuroinformatics Center (NIC) Department."— Presentation transcript:

Similar presentations

About project

Feedback