Department of Computer and Information Science

Slides:



Advertisements
Similar presentations
1 Coven a Framework for High Performance Problem Solving Environments Nathan A. DeBardeleben Walter B. Ligon III Sourabh Pandit Dan C. Stanzione Jr. Parallel.
Advertisements

1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*
Productive Performance Tools for Heterogeneous Parallel Computing Allen D. Malony Department of Computer and Information Science University of Oregon Shigeo.
Scripting Languages For Virtual Worlds. Outline Necessary Features Classes, Prototypes, and Mixins Static vs. Dynamic Typing Concurrency Versioning Distribution.
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
Research & Development Roadmap 1. Outline A New Communication Framework Giving Bro Control over the Network Security Monitoring for Industrial Control.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Department of Computer Science A Static Program Analyzer to increase software reuse Ramakrishnan Venkitaraman and Gopal Gupta.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Advanced / Other Programming Models Sathish Vadhiyar.
Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.
MAPLD Reconfigurable Computing Birds-of-a-Feather Programming Tools Jeffrey S. Vetter M. C. Smith, P. C. Roth O. O. Storaasli, S. R. Alam
1 Geospatial and Business Intelligence Jean-Sébastien Turcotte Executive VP San Francisco - April 2007 Streamlining web mapping applications.
A Software Framework for Distributed Services Michael M. McKerns and Michael A.G. Aivazis California Institute of Technology, Pasadena, CA Introduction.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Tuning Threaded Code with Intel® Parallel Amplifier.
Online Performance Analysis and Visualization of Large-Scale Parallel Applications Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance Research.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3,
Introduction to Operating Systems Concepts
CIS 375 Bruce R. Maxim UM-Dearborn
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Introduction to threads
Computer Engg, IIT(BHU)
Performance Tool Integration in Programming Environments for GPU Acceleration: Experiences with TAU and HMPP Allen D. Malony1,2, Shangkar Mayanglambam1.
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
Productive Performance Tools for Heterogeneous Parallel Computing
Operating System Structure
GPU Computing CIS-543 Lecture 10: Streams and Events
Accelerators to Applications
Performance Technology for Scalable Parallel Systems
Self Healing and Dynamic Construction Framework:
OPERATING SYSTEMS CS3502 Fall 2017
TAU integration with Score-P
Computer Engg, IIT(BHU)
CMPE419 Mobile Application Development
Model-Driven Analysis Frameworks for Embedded Systems
Chapter 20 Object-Oriented Analysis and Design
Chapter 6 – Architectural Design
CSE8380 Parallel and Distributed Processing Presentation
Chapter 5 Architectural Design.
Analysis models and design models
An Introduction to Software Architecture
Allen D. Malony Computer & Information Science Department
COMP60621 Fundamentals of Parallel and Distributed Systems
Multithreaded Programming
Outline Introduction Motivation for performance mapping SEAA model
Software Acceleration in Hybrid Systems Xiaoqiao (XQ) Meng IBM T. J
Chapter 4: Threads.
Why Threads Are A Bad Idea (for most purposes)
Applying Use Cases (Chapters 25,26)
Applying Use Cases (Chapters 25,26)
Chapter 5 Architectural Design.
An Orchestration Language for Parallel Objects
Why Threads Are A Bad Idea (for most purposes)
Why Threads Are A Bad Idea (for most purposes)
COMP60611 Fundamentals of Parallel and Distributed Systems
From Use Cases to Implementation
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Department of Computer and Information Science The State of TAU Allen D. Malony Department of Computer and Information Science University of Oregon

Outline Part 1 – Motivation and Pontification Performance productivity and engineering Role of structure, performance knowledge, intelligence Model-based performance engineering and expectations Part 2 – Some Present Work High-level language integration (Charm++) Heterogeneous parallel performance (GPU accelerators) Part 3 – Current/Future Projects POINT PRIMA MOJO State of TAU?

Parallel Performance Engineering and Productivity Scalable, optimized applications deliver HPC promise Optimization through performance engineering process Understand performance complexity and inefficiencies Tune application to run optimally at scale How to make the process more effective and productive? What performance technology should be used? Performance technology part of larger environment Programmability, reusability, portability, robustness Application development and optimization productivity Process, performance technology, and use will change as parallel systems evolve and technology drivers change Goal is to deliver effective performance with high productivity value now and in the future

Performance Technology and Scale What does it mean for performance technology to scale? Instrumentation, measurement, analysis, tuning Scaling complicates observation and analysis Performance data size and value standard approaches deliver a lot of data with little value Measurement overhead, intrusion, perturbation, noise measurement overhead  intrusion  perturbation tradeoff with analysis accuracy “noise” in the system distorts performance Analysis complexity increases More events, larger data, more parallelism Traditional empirical process breaks down at extremes

Role of Structure in Performance Understanding Dealing with increased complexity of the problem (measurement and analysis) should not be the only focus There is fundamental structure that underlies application execution Performance is a consequence of these forms and behaviors It is the relationship between performance data and structure that matters This need not be complex (not as complex as it seems) Shigeo Fukuda, 1987 Lunch with a Helmut On

Role of Structure in Performance Understanding Dealing with increased complexity of the problem (measurement and analysis) should not be the only focus There is fundamental structure that underlies application execution Performance is a consequence of these forms and behaviors It is the relationship between performance data and structure that matters This need not be complex (not as complex as it seems) Shigeo Fukuda, 1987 Lunch with a Helmut On

Role of Intelligence, Automation, and Knowledge Scale forces the process to become more intelligent Even with intelligent and application-specific tools, the decisions of what to analyze is difficult What will enhance productive application development? Process/tools must be more application aware What are the important events and performance metrics Tied to application structure and computational model Tied to application domain and algorithms More automation and knowledge-based decision making Automate performance data analysis / mining / learning Include predictive features and experiment refinement Knowledge-driven adaptation and optimization guidance

Usability, Integration, and Performance Tool Design What are usable performance tools? Support the performance problem solving process Does not make overall environment less productive Usable should not mean to make simple Usability should not be at the cost of functionality Robust performance technology needs integration Integration in performance system does not imply Performance system is monolithic Performance system is a “Swiss army knife” Deconstruction or refactoring Should lead to opportunities for integration

Whole Performance Evaluation Extreme scale performance is an optimized orchestration Application, processor, memory, network, I/O Reductionist approaches to performance will be unable to support optimization and productivity objectives Application-level only performance view is myopic Interplay of hardware, software, and system components Ultimately determines how performance is delivered Performance should be evaluated in toto Application and system components Understand effects of performance interactions Identify opportunities for optimization across levels Need whole performance evaluation practice

Parallel Performance Diagnosis (circa 1991)

Model-based Performance Engineering Performance engineering tools and practice must incorporate a performance knowledge discovery process Focus on and automate performance problem identification and use to guide tuning decisions Model-oriented knowledge Computational semantics of the application Symbolic models for algorithms Performance models for system architectures / components Use to define performance expectations Higher-level abstraction for performance investigation Application developers can be more directly involved in the performance engineering process

Performance Expectations (à la Vetter) Context for understanding the performance behavior of the application and system Traditional performance implicitly requires an expectation Users are forced to reason from the perspective of absolute performance for every performance experiment and every application operation Peak measures provide an absolute upper bound Empirical data alone does not lead to performance understanding and is insufficient for optimization Empirical measurements lack a context for determining whether the operations under consideration are performing well or performing poorly

Sources of Performance Expectations Computation Model – Operational semantics of a parallel application specified by its algorithms and structure define a space of relevant performance behavior Symbolic Performance Model – Symbolic or analytical model provides the relevant parameters for the performance model, which generate expectations Historical Performance Data – Provides real execution history and empirical data on similar platforms Relative Performance Data – Used to compare similar application operations across architectural components Architectural parameters – Based on what is known of processor, memory, and communications performance

Model Use Performance models derived from application knowledge, performance experiments, and symbolic analysis They will be used to predict the performance of individual system components, such as communication and computation, as well as the application as a whole. The models can then be compared to empirical measurements—manually or automatically—to pin point or dynamically rectify performance problems. Further, these models can be evaluated at runtime in order to reduce perturbation and data management burdens, and to dynamically reconfigure system software.

Call for “Extreme” Performance Engineering Strategy to respond to technology changes and disruptions Strategy to carry forward performance expertise and knowledge Built on robust, integrated performance measurement infrastructure Model-oriented with knowledge-based reasoning Community-driven knowledge engineering Automated data / decision analytics Not just for extreme systems

State of TAU (Now) Large-scale, robust performance measurement and analysis Robust and mature Broad use in NSF, DOE, DoD Performance database TAU PerfDMF PERI DB reference platform Performance data mining TAU PerfExplorer multi-experiment data mining analysis scripting, inference

State of TAU (Now) Scalable OS/RTS tools (DOE FastOS) Scalable performance monitoring TAUoverMRNet Integrated system/application measurement KTAU

State of TAU (Now) SciDAC TASCS SciDAC FACETS Computational Quality of Service (CQoS) Configuration, adaptation SciDAC FACETS Load balance modeling Automated performance testing

High-level Language Integration High-level parallel paradigms improve productivity Rich abstractions for application development Hide low-level coding and computation complexities Natural tension between powerful development environments and ability to achieve high performance General dogma Further the application is removed from raw machine the more susceptible to performance inefficiencies Performance problems and their sources become harder to observe and to understand Dual goals of productivity and performance require performance tool integration and language knowledge

Charm++ Approach Parallel object-oriented programming based on C++ Programs decomposed into set of parallel communicating objects (chares) Runtime system maps to onto parallel processes/threads

Charm++ Approach (continued) Object entry method invocation triggers computation entry method message for remote process queued messages scheduled by Charm++ runtime scheduler entry methods executed to completion may call new entry methods and other routines

Charm++ Performance Events Several points in runtime system to observe events Make performance measurements (performance events) Obtain information on execution context Charm++ events Start of an entry method End of an entry method Sending a message to another object Change in scheduler state: active to idle idle to active Observation of multiple events at different levels of abstraction are needed to get full performance view logical execution model runtime object interaction resource oriented state transitions

Charm++ Performance Framework How parallel language system operationalizes events is critical to building an effective performance framework Charm++ implements performance callbacks Runtime system calls performance module at events Any registered performance module (client) is invoked Event ID and default performance data forwarded Clients can access to Charm++ internal runtime routines Performance framework exposes set of key runtime events as a base C++ class Performance modules inherit and implement methods Listen only to events of interest Framework calls performance client initialization

Charm++ Performance Framework and Modules Framework allows for separation of concerns Event visibility Event measurement Allows measurement extension and customization New modules may introduce new observation requirements Our TAU module inherits the common trace interface as does Projections and Summary TAU Profiler API 24

NAMD Performance Study Demonstrate integrated analysis in real application NAMD parallel molecular dynamics code Compute interactions between atoms Group atoms in patches Hybrid decomposition Distribute patches to processors Create compute objects to handle interactions between atoms of different patches Performance strategy Distribute computational workload evenly Keep communication to a minimum Several factors: model complexity, size, balancing cost

NAMD STMV Performance Main Idle

NAMD STMV – Comparative Profile Analysis

NAMD Performance Data Mining Use TAU PerfExplorer data mining tool Dimensionality reduction, clustering, correlation Single profiles and across multiple experiments PmeXPencil PmeZPencil PmeYPencil

Heterogeneous Parallel Systems What does it mean to be heterogeneous? Diverse in character or content (New Oxford America) Not homogeneous (Anonymous) Heterogeneous (computing) technology more accessible Multicore processors Manycore accelerators (e.g., NVIDIA Tesla GPU) High-performance processing engines (e.g., IBM Cell BE) Performance is the main driving concern Heterogeneity is arguably the only path to extreme scale Heterogeneous (hybrid) software technology required Greater performance enables more powerful software Will give rise to more sophisticated software environments

Implications for Performance Tools Tools should support parallel computation models Current status quo is comfortable Mostly homogeneous parallel systems and software Shared-memory multithreading – OpenMP Distributed-memory message passing – MPI Parallel computational models are relatively stable (simple) Corresponding performance models are relatively tractable Parallel performance tools are just keeping up Heterogeneity creates richer computational potential Results in greater performance diversity and complexity Performance tools have to support richer computation models and broader (less constrained) performance perspectives

Performance Views of External Execution Heterogeneous applications have concurrent execution Main “host” path and “external” external paths Want to capture performance for all execution paths External execution may be difficult to measure What perspective does the host have of external entity? Determines the semantics of the measurement data Host regards external execution as a task Tasks operate concurrently with respect to the host Requires support for tracking asynchronous execution Host-side measurements of task events Performance data received from external task (limited) May depend on host for performance data I/O

GPU Computing and CUDA Programming GPU used as accelerators for parallel programs CUDA programming environment Programming of multiple streams and GPU devices multiple streams execute concurrently Programming of data transfers to/from GPU device Programming of GPU kernel code Asynchronous operation synchronization with streams CUDA Profiler Provides extensive stream-level measurements creates post-mortem event trace of kernel operation on streams difficult to merge with application performance data

CPU – GPU Execution Scenarios Synchronous Asynchronous

TAU CUDA Performance Measurement (TAUcuda) Build on CUDA stream event interface Allow “events” to be placed in streams and processed events are timestamped CUDA runtime reports GPU timing in event structure Events are reported back to CPU when requested use begin and end events to calculate intervals Want to associate TAU event context with CUDA events Get top of TAU event stack at begin (TAU context) CUDA kernel invocations are asynchronous CPU does not see actual CUDA “end” event CPU retrieves events in a non-blocking and blocking manner Want to capture “waiting time”

CPU-GPU Operation and TAUcuda Events

TAU CUDA Measurement API void tau_cuda_init(int argc, char **argv); To be called when the application starts Initializes data structures and checks GPU status void tau_cuda_exit() To be called before any thread exits at end of application All the CUDA profile data output for each thread of execution void* tau_cuda_stream_begin(char *event, cudaStream_t stream); Called before CUDA statements to be measured Returns handle which should be used in the end call If event is new or the TAU context is new for the event, a new CUDA event profile object is created void tau_cuda_stream_end(void * handle); Called immediately after CUDA statements to be measured Handle identifies the stream Inserts a CUDA event into the stream

TAU CUDA Measurement API (2) vector<Event> tau_cuda_update(); Checks for completed CUDA events on all streams Non-blocking and returns # completed on each stream int tau_cuda_update(cudaStream_t stream); Same as tau_cuda_update() except for a particular stream Non-blocking and returns # completed on the stream vector<Event> tau_cuda_finalize(); Waits for all CUDA events to complete on all streams Blocking and returns # completed on each stream int tau_cuda_finalize(cudaStream_t stream); Same as tau_cuda_finalize() except for a particular stream Blocking and returns # completed on the stream

Scenario Results – One and Two Streams Run simple CUDA experiments to validate TAU CUDA Tesla S1070 test system

Case Study: TAUcuda in NAMD TAU integrated in Charm++ Charm++ applications NAMD is a molecular dynamics application Parallel Framework for Unstructure Meshing (ParFUM) Both have been accelerated with CUDA Demonstration use of TAUcuda Observe the effect of CUDA acceleration Show scaling results for GPU cluster execution Experimental environments Two S1070 GPU servers (Universit of Oregon) AC cluster: 32 nodes, 4 Tesla GPUs per node (UIUC)

NAMD GPU Profile (Two GPU Devices) Test out TAU CUDA with NAMD Two processes with one Tesla GPU for each CPU profile GPU profile (P0) GPU profile (P1)

NAMD GPU Efficiency Gain (16 versus 32 GPUs) AC cluster: 16 and 32 processes dev_sum_forces: 50% improvement dev_nonbounded: 100% improvement Event TAU Context Device Stream

Integration of HMPP and TAU / TAUcuda HMPP Workbench High-level system for programming multi-core and GPU-accelerated systems Portable and retargetable TAU Performance System® Parallel performance instrumentation, measurement, and analysis system Target scalable parallel systems TAUcuda CUDA performance measurement Integrated with TAU HMPP–TAU = HMPP + TAU + TAUcuda

Case Study: HMPP-TAU User Application HMPP Runtime C U D A HMPP CUDA Codelet TAU TAUcuda Measurement User events HMPP events Codelet events Measurement CUDA stream events Waiting information

Data/Execution Overlap (Full Environment)

High-Productivity Performance Engineering (POINT) Testbed Apps ENZO NAMD NEMO3D

Model Oriented Global Optimization (MOGO) Empirical performance data evaluated with respect to performance expectations at various levels of abstraction

Performance Refactoring (PRIMA) (UO, Juelich) Integration of instrumentation and measurement Core infrastructure Focus on TAU and Scalasca University of Oregon, Research Centre Juelich Refactor instrumentation, measurement, and analysis Build next-generation tools on new common foundation Extend to involve the SILC project Juelich, TU Dresden, TU Munich

What do we need? Performance knowledge engineering At all levels Support the building of models Represented in form the tools can reason about Understanding of “performance” interactions Between integrated components Control and data interactions Properties of concurrency and dependencies With respect to scientific problem formulation Translate to performance requirements More robust tools that are being used more broadly

Conclusion Performance problem solving (process, technology) Evolve process and technology to meet scaling challenges Closer integration with application development and execution environment Raise the level of performance problem analysis Scalable performance monitoring Performance data mining and expert analysis Whole performance evaluation Model-based performance engineering Performance expectations Support creation of robust performance technology that is available on all HEC system now and in the future

Support Acknowledgements Department of Energy (DOE) Office of Science MICS, Argonne National Lab ASC/NNSA University of Utah ASC/NNSA Level 1 ASC/NNSA, Lawrence Livermore National Lab Department of Defense (DoD) HPC Modernization Office (HPCMO) NSF Software Development for Cyberinfrastructure (SDCI) Research Centre Juelich Los Alamos National Laboratory Technical University Dresden ParaTools, Inc.

PerfSuite and TAU Integration

PerfSuite and TAU Integration

Performance Data Mining / Decision Analytics Should not just redescribe the performance results Should explain performance phenomena What are the causes for performance observed? What are the factors and how do they interrelate? Performance analytics, forensics, and decision support Need to add knowledge to do more intelligent things Automated analysis needs good informed feedback iterative tuning, performance regression testing Performance model generation requires interpretation We need better methods and tools for Integrating meta-information Knowledge-based performance problem solving Check Dresden talk for common language for these slides

Multiple String Alignment (MSA) Compare protein sequences with unknown function to sequences with known function Progressive alignment Widely used heuristic ClustalW Compute a pairwise distance matrix (90% of the time spent here) Construct a guide tree Progressive alignment along the tree OpenMP parallelism did not scale well Identify relations between factors Photo © NASA

MSA OpenMP Load Imbalance #pragma omp for for (m=first; m<=last; m++) { for (n=m+1; n<=last; n++) { … } Inner Loop Outer Loop

Analysis Workflow and Inference Rules For each instrumented region: compute mean, stddev across all threads compute, assert stddev/mean ratio correlate region against all other regions assert correlation assert “severity” of event (exclusive time) Rule1: IF severity(r) > 0.05 AND ratio(r) > 0.25 THEN alert(“load imbalance: r1”) AND assert imbalanced® Rule2: IF imbalanced(r1) AND imbalanced(r2) AND calls (r1,r2) AND correlation(r1,r2) < -0.5 THEN alert(“new schedule suggested: r1, r2”) Event to region Severity -> exclusive time percentage

Example PerfExplorer Output --------------- PerfExplorer test script start ------------ --- Looking for load imbalances --- Loading Rules… Reading rules: openuh/OpenUHRules.drl... done. loading the data… Main Event: main Firing rules... The event LOOP #3 [file:/mnt/netapp/home1/khuck/openuh/src/fpga/msap.c <63, 163>] has a high load imbalance for metric P_WALL_CLOCK_TIME Mean/Stddev ratio: 0.667, Stddev actual: 6636425.1875 Percentage of total runtime: 27.15% The event LOOP #2 [file:/mnt/netapp/home1/khuck/openuh/src/fpga/msap.c <65, 158>] has a high load imbalance for metric P_WALL_CLOCK_TIME Mean/Stddev ratio: 0.260, Stddev actual: 1.74530281875E7 Percentage of total runtime: 71.40% LOOP #3 [file:/mnt/netapp/home1/khuck/openuh/src/fpga/msap.c <63, 163>] calls LOOP #2 [file:/mnt/netapp/home1/khuck/openuh/src/fpga/msap.c <65, 158>], and they are both showing signs of load imbalance. If these events are in an OpenMP parallel region, consider methods to balance the workload, such as dynamic instead of static work assignment. ...done with rules. ---------------- PerfExplorer test script end ------------- Rule1 true! Rule1 true! Imalanced Name the rules! Rule2 true!

MSA – Improved Scaling Before: efficiency < 70% with 16 processors, 400 sequence set After: efficiency > 92.5% with 16 processors, 400 sequence set Efficiency ~= 80% with 128 processors, 1000 sequence set Efficiency ~= 80% with 128 processors, 1000 sequence set Scheduling parameters #pragma omp for schedule (dynamic,1) nowait

PGI Compiler for GPUs Accelerator programming support Compiled program Fortran and C Directive-based programming Loop parallelization for acceleration on GPUs PGI 9.0 for x64-based Linux (preview release) Compiled program CUDA target Synchronous accelerator operations Profile interface support

TAU with PGI Accelerator Compiler Compiler-based instrumentation for PGI compilers Track runtime system events as seen by host processor Wrapped runtime library Show source information associated with events Routine name File name, source line number for kernel Variable names in memory upload, download operations Grid sizes Any configuration of TAU with PGI supports tracking of accelerator operations Tested with PGI 8.0.3, 8.0.5, 8.0.6 compilers Qualification and testing with PGI 9.0-1 complete