The TAU Performance System: Advances in Performance Mapping Sameer Shende University of Oregon.

Slides:



Advertisements
Similar presentations
Machine Learning-based Autotuning with TAU and Active Harmony Nicholas Chaimov University of Oregon Paradyn Week 2013 April 29, 2013.
Advertisements

K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.
Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.
The Path to Multi-core Tools Paul Petersen. Multi-coreToolsThePathTo 2 Outline Motivation Where are we now What is easy to do next What is missing.
Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.
Sameer Shende Department of Computer and Information Science Neuro Informatics Center University of Oregon Tool Interoperability.
Grouping Performance Data in TAU  Profile Groups  A group of related routines forms a profile group  Statically defined  TAU_DEFAULT, TAU_USER[1-5],
Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science Institute University.
Allen D. Malony Department of Computer and Information Science Computational Science Institute University of Oregon Integrating Performance.
TAU Parallel Performance System DOD UGC 2004 Tutorial Allen D. Malony, Sameer Shende, Robert Bell Univesity of Oregon.
The TAU Performance Technology for Complex Parallel Systems (Performance Analysis Bring Your Own Code Workshop, NRL Washington D.C.) Sameer Shende, Allen.
Nick Trebon, Alan Morris, Jaideep Ray, Sameer Shende, Allen Malony {ntrebon, amorris, Department of.
On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.
Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science Institute University.
Allen D. Malony Department of Computer and Information Science Computational Science Institute University of Oregon TAU Performance.
June 2, 2003ICCS Performance Instrumentation and Measurement for Terascale Systems Jack Dongarra, Shirley Moore, Philip Mucci University of Tennessee.
GHS: A Performance Prediction and Task Scheduling System for Grid Computing Xian-He Sun Department of Computer Science Illinois Institute of Technology.
Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science Institute University.
Allen D. Malony, Sameer Shende, Robert Bell Department of Computer and Information Science Computational Science Institute, NeuroInformatics.
Kai Li, Allen D. Malony, Robert Bell, Sameer Shende Department of Computer and Information Science Computational.
Sameer Shende, Allen D. Malony Computer & Information Science Department Computational Science Institute University of Oregon.
Mapping Techniques for Load Balancing
Asst.Prof.Dr.Ahmet Ünveren SPRING Computer Engineering Department Asst.Prof.Dr.Ahmet Ünveren SPRING Computer Engineering Department.
Performance Technology for Complex Parallel Systems REFERENCES.
SC’01 Tutorial Nov. 7, 2001 TAU Performance System Framework  Tuning and Analysis Utilities  Performance system framework for scalable parallel and distributed.
Performance Technology for Complex Parallel Systems Part 2 – Complexity Scenarios Sameer Shende.
German National Research Center for Information Technology Research Institute for Computer Architecture and Software Technology German National Research.
Paradyn Week – April 14, 2004 – Madison, WI DPOMP: A DPCL Based Infrastructure for Performance Monitoring of OpenMP Applications Bernd Mohr Forschungszentrum.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Service-enabling Legacy Applications for the GENIE Project Sofia Panagiotidi, Jeremy Cohen, John Darlington, Marko Krznarić and Eleftheria Katsiri.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.
Performance Technology for Complex Parallel Systems Part 2 – Complexity Scenarios Sameer Shende.
SciDAC All Hands Meeting, March 2-3, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Avery Ching, Kenin Coloma, Jianwei.
Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.
Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,
Allen D. Malony, Sameer S. Shende, Alan Morris, Robert Bell, Kevin Huck, Nick Trebon, Suravee Suthikulpanit, Kai Li, Li Li
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Manifold Execution Model and System.
Allen D. Malony, Sameer Shende, Li Li, Kevin Huck Department of Computer and Information Science Performance.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.
Virtual Lab AMsterdam VLAMsterdam Abstract Machine Toolbox A.S.Z. Belloum, Z.W. Hendrikse, E.C. Kaletas, H. Afsarmanesh and L.O. Hertzberger Computer Architecture.
SDM Center Parallel I/O Storage Efficient Access Team.
Performane Analyzer Performance Analysis and Visualization of Large-Scale Uintah Simulations Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance.
Performance Technology for Complex Parallel Systems Part 1 – Overview and TAU Introduction Allen D. Malony.
Operating Systems Unit 2: – Process Context switch Interrupt Interprocess communication – Thread Thread models Operating Systems.
Performance Technology for Complex Parallel Systems Part 1 – Overview and TAU Introduction Allen D. Malony.
Allen D. Malony Department of Computer and Information Science Computational Science Institute University of Oregon Integrating Performance.
Online Performance Analysis and Visualization of Large-Scale Parallel Applications Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance Research.
Parallel OpenFOAM CFD Performance Studies Student: Adi Farshteindiker Advisors: Dr. Guy Tel-Zur,Prof. Shlomi Dolev The Department of Computer Science Faculty.
Performance Tool Integration in Programming Environments for GPU Acceleration: Experiences with TAU and HMPP Allen D. Malony1,2, Shangkar Mayanglambam1.
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
Productive Performance Tools for Heterogeneous Parallel Computing
Performance Technology for Scalable Parallel Systems
Self Healing and Dynamic Construction Framework:
Tracing and Performance Analysis Tools for Heterogeneous Multicore System by Soon Thean Siew.
TAU integration with Score-P
Parallel Objects: Virtualization & In-Process Components
Allen D. Malony, Sameer Shende
TAU Parallel Performance System
Many-core Software Development Platforms
TAU: A Framework for Parallel Performance Analysis
Allen D. Malony Computer & Information Science Department
Outline Introduction Motivation for performance mapping SEAA model
Performance Technology for Complex Parallel and Distributed Systems
Parallel Program Analysis Framework for the DOE ACTS Toolkit
TAU Performance DataBase Framework (PerfDBF)
Presentation transcript:

The TAU Performance System: Advances in Performance Mapping Sameer Shende University of Oregon

Outline  Introduction  Motivation for performance mapping  SEAA model  Examples:  POOMA II  Uintah  Conclusions

Motivation  Complexity  Layered software  Multi-level instrumentation  Entities not directly in source  Mapping  User-level abstractions

Hypothetical Mapping Example Engine  Particles distributed on surfaces of a cube Work packets

Hypothetical Mapping Example Source Particle* P[MAX]; /* Array of particles */ int GenerateParticles() { /* distribute particles over all faces of the cube */ for (int face=0, last=0; face < 6; face++){ /* particles on this face */ int particles_on_this_face = num(face); for (int i=last; i < particles_on_this_face; i++) { /* particle properties are a function of face */ P[i] =... f(face);... } last+= particles_on_this_face; }

Hypothetical Mapping Example (continued)  How much time is spent processing face i particles?  What is the distribution of performance among faces? int ProcessParticle(Particle *p) { /* perform some computation on p */ } int main() { GenerateParticles(); /* create a list of particles */ for (int i = 0; i < N; i++) /* iterates over the list */ ProcessParticle(P[i]); }

No Performance Mapping versus Mapping  Typical performance tools report performance with respect to routines  Do not provide support for mapping  Performance tools with SEAA mapping can observe performance with respect to scientist’s programming and problem abstractions without mappingwith mapping

Semantic Entities/Attributes/Associations  New dynamic mapping scheme - SEAA  Entities defined at any level of abstraction  Attribute entity with semantic information  Entity-to-entity associations  Two association types:  Embedded – extends data structure of associated object to store performance measurement entity  External – creates an external look-up table using address of object as the key to locate performance measurement entity

Tuning and Analysis Utilities (TAU)  Performance system framework for scalable parallel and distributed high-performance computing  General complex system computation model  nodes / contexts / threads  Multi-level: system / software / parallelism  Measurement and analysis abstraction  Integrated toolkit for performance instrumentation, measurement, analysis, and visualization  Portable performance profiling/tracing facility

TAU Performance System Architecture

Multi-Level Instrumentation in TAU  Uses multiple instrumentation interfaces  Shares information: cooperation between interfaces  Targets a common performance model  Taps information at multiple levels  source (manual annotation)  preprocessor (PDT, OPARI/OpenMP)  compiler (instrumentation-aware compilation)  library (MPI wrapper library)  runtime (DyninstAPI[U.Wisc, U.Maryland])  virtual machine (JVMPI [Sun])

Program Database Toolkit (PDT)

Performance Mapping in TAU  Supports both embedded and external associations: Embedded association External association Data (object) Timer Performance Data... Hash Table

TAU Mapping API  Source-Level API  TAU_MAPPING(statement, key); TAU_MAPPING_OBJECT(funcIdVar); TAU_MAPPING_LINK(funcIdVar, key);  TAU_MAPPING_PROFILE (funcIdVar); TAU_MAPPING_PROFILE_TIMER(timer, funcIdVar); TAU_MAPPING_PROFILE_START(timer); TAU_MAPPING_PROFILE_STOP(timer);

Mapping in POOMA II  POOMA [LANL] is a C++ framework for Computational Physics  Provides high-level abstractions:  Fields (Arrays), Particles, FFT, etc.  Encapsulates details of parallelism, data- distribution  Uses custom-computation kernels for efficient expression evaluation [PETE]  Uses vertical-execution of array statements to re-use cache [SMARTS]

POOMA II Array Example  Multi- dimensional array statements  A=B+C+D;

POOMA, PETE and SMARTS

Using Synchronous Timers

Form of Expression Templates in POOMA

Mapping Problem  One-to-many upward mapping  Traditional methods of mapping (ammortization/aggregation) lack resolution and accuracy! Template <class LHS, class RHS, class Op, class EvalTag> void ExpressionKernel<LHS,RHS,Op, EvalTag>::run() {/* iterate execution */ } A=1.0; B=2.0; … A= B+C+D; C=E-A+2.0*D;...

POOMA II Mappings  Each work packet belongs to an ExpressionKernel object  Each statement’s form associated with timer in the constructor of ExpressionKernel  ExpressionKernel class extended with embedded timer  Timing calls and entry and exit of run() method start and stop per object timer

Results of TAU Mappings  Per-statement profile!

POOMA Traces  Helps bridge the semantic-gap!

Uintah  U. of Utah, C-SAFE ASCI Level 1 Center  Component-based framework for modeling and simulation of the interactions between hydrocarbon fires and high-energy explosives and propellants [Uintah]  Work-packets belong to a higher-level task that a scientist understands  e.g., “interpolate particles to grid”

Without Mapping

Using External Associations  When task is created, a timer is created with the same name  Two level mappings:  Level 1:  Level 2:

Using Task Mappings

Tracing Uintah Execution

Two-Level Mappings: Tasks+Patch

Conclusions  New performance mapping model (SEAA)  Application of SEAA to:  asynchronously executed work packets in POOMA  packet-task-patch mapping in Uintah  Mapping performance data helps bridge the gap in understanding performance data  Complex mapping problems  cross-context mapping

Information  TAU (  PDT (  Tutorial at SC’01: M11 B. Mohr, A. Malony, S. Shende, “Performance Technology for Complex Parallel Systems” Nov. 7, 2001, Denver, CO.  LANL, NIC Booth, SC’01.

Support Acknowledgement  TAU and PDT support:  Department of Engergy (DOE)  DOE 2000 ACTS contract  DOE MICS contract  DOE ASCI Level 3 (LANL, LLNL)  DARPA  NSF National Young Investigator (NYI) award