Presentation is loading. Please wait.

Presentation is loading. Please wait.

Department of Computer and Information Science

Similar presentations


Presentation on theme: "Department of Computer and Information Science"— Presentation transcript:

1 Department of Computer and Information Science
The State of TAU Allen D. Malony Department of Computer and Information Science University of Oregon

2 Outline Part 1 – Motivation and Pontification
Performance productivity and engineering Role of structure, performance knowledge, intelligence Model-based performance engineering and expectations Part 2 – Some Present Work High-level language integration (Charm++) Heterogeneous parallel performance (GPU accelerators) Part 3 – Current/Future Projects POINT PRIMA MOJO State of TAU?

3 Parallel Performance Engineering and Productivity
Scalable, optimized applications deliver HPC promise Optimization through performance engineering process Understand performance complexity and inefficiencies Tune application to run optimally at scale How to make the process more effective and productive? What performance technology should be used? Performance technology part of larger environment Programmability, reusability, portability, robustness Application development and optimization productivity Process, performance technology, and use will change as parallel systems evolve and technology drivers change Goal is to deliver effective performance with high productivity value now and in the future

4 Performance Technology and Scale
What does it mean for performance technology to scale? Instrumentation, measurement, analysis, tuning Scaling complicates observation and analysis Performance data size and value standard approaches deliver a lot of data with little value Measurement overhead, intrusion, perturbation, noise measurement overhead  intrusion  perturbation tradeoff with analysis accuracy “noise” in the system distorts performance Analysis complexity increases More events, larger data, more parallelism Traditional empirical process breaks down at extremes

5 Role of Structure in Performance Understanding
Dealing with increased complexity of the problem (measurement and analysis) should not be the only focus There is fundamental structure that underlies application execution Performance is a consequence of these forms and behaviors It is the relationship between performance data and structure that matters This need not be complex (not as complex as it seems) Shigeo Fukuda, 1987 Lunch with a Helmut On

6 Role of Structure in Performance Understanding
Dealing with increased complexity of the problem (measurement and analysis) should not be the only focus There is fundamental structure that underlies application execution Performance is a consequence of these forms and behaviors It is the relationship between performance data and structure that matters This need not be complex (not as complex as it seems) Shigeo Fukuda, 1987 Lunch with a Helmut On

7 Role of Intelligence, Automation, and Knowledge
Scale forces the process to become more intelligent Even with intelligent and application-specific tools, the decisions of what to analyze is difficult What will enhance productive application development? Process/tools must be more application aware What are the important events and performance metrics Tied to application structure and computational model Tied to application domain and algorithms More automation and knowledge-based decision making Automate performance data analysis / mining / learning Include predictive features and experiment refinement Knowledge-driven adaptation and optimization guidance

8 Usability, Integration, and Performance Tool Design
What are usable performance tools? Support the performance problem solving process Does not make overall environment less productive Usable should not mean to make simple Usability should not be at the cost of functionality Robust performance technology needs integration Integration in performance system does not imply Performance system is monolithic Performance system is a “Swiss army knife” Deconstruction or refactoring Should lead to opportunities for integration

9 Whole Performance Evaluation
Extreme scale performance is an optimized orchestration Application, processor, memory, network, I/O Reductionist approaches to performance will be unable to support optimization and productivity objectives Application-level only performance view is myopic Interplay of hardware, software, and system components Ultimately determines how performance is delivered Performance should be evaluated in toto Application and system components Understand effects of performance interactions Identify opportunities for optimization across levels Need whole performance evaluation practice

10 Parallel Performance Diagnosis (circa 1991)

11 Model-based Performance Engineering
Performance engineering tools and practice must incorporate a performance knowledge discovery process Focus on and automate performance problem identification and use to guide tuning decisions Model-oriented knowledge Computational semantics of the application Symbolic models for algorithms Performance models for system architectures / components Use to define performance expectations Higher-level abstraction for performance investigation Application developers can be more directly involved in the performance engineering process

12 Performance Expectations (à la Vetter)
Context for understanding the performance behavior of the application and system Traditional performance implicitly requires an expectation Users are forced to reason from the perspective of absolute performance for every performance experiment and every application operation Peak measures provide an absolute upper bound Empirical data alone does not lead to performance understanding and is insufficient for optimization Empirical measurements lack a context for determining whether the operations under consideration are performing well or performing poorly

13 Sources of Performance Expectations
Computation Model – Operational semantics of a parallel application specified by its algorithms and structure define a space of relevant performance behavior Symbolic Performance Model – Symbolic or analytical model provides the relevant parameters for the performance model, which generate expectations Historical Performance Data – Provides real execution history and empirical data on similar platforms Relative Performance Data – Used to compare similar application operations across architectural components Architectural parameters – Based on what is known of processor, memory, and communications performance

14 Model Use Performance models derived from application knowledge, performance experiments, and symbolic analysis They will be used to predict the performance of individual system components, such as communication and computation, as well as the application as a whole. The models can then be compared to empirical measurements—manually or automatically—to pin point or dynamically rectify performance problems. Further, these models can be evaluated at runtime in order to reduce perturbation and data management burdens, and to dynamically reconfigure system software.

15 Call for “Extreme” Performance Engineering
Strategy to respond to technology changes and disruptions Strategy to carry forward performance expertise and knowledge Built on robust, integrated performance measurement infrastructure Model-oriented with knowledge-based reasoning Community-driven knowledge engineering Automated data / decision analytics Not just for extreme systems

16 State of TAU (Now) Large-scale, robust performance measurement and analysis Robust and mature Broad use in NSF, DOE, DoD Performance database TAU PerfDMF PERI DB reference platform Performance data mining TAU PerfExplorer multi-experiment data mining analysis scripting, inference

17 State of TAU (Now) Scalable OS/RTS tools (DOE FastOS)
Scalable performance monitoring TAUoverMRNet Integrated system/application measurement KTAU

18 State of TAU (Now) SciDAC TASCS SciDAC FACETS
Computational Quality of Service (CQoS) Configuration, adaptation SciDAC FACETS Load balance modeling Automated performance testing

19 High-level Language Integration
High-level parallel paradigms improve productivity Rich abstractions for application development Hide low-level coding and computation complexities Natural tension between powerful development environments and ability to achieve high performance General dogma Further the application is removed from raw machine the more susceptible to performance inefficiencies Performance problems and their sources become harder to observe and to understand Dual goals of productivity and performance require performance tool integration and language knowledge

20 Charm++ Approach Parallel object-oriented programming based on C++
Programs decomposed into set of parallel communicating objects (chares) Runtime system maps to onto parallel processes/threads

21 Charm++ Approach (continued)
Object entry method invocation triggers computation entry method message for remote process queued messages scheduled by Charm++ runtime scheduler entry methods executed to completion may call new entry methods and other routines

22 Charm++ Performance Events
Several points in runtime system to observe events Make performance measurements (performance events) Obtain information on execution context Charm++ events Start of an entry method End of an entry method Sending a message to another object Change in scheduler state: active to idle idle to active Observation of multiple events at different levels of abstraction are needed to get full performance view logical execution model runtime object interaction resource oriented state transitions

23 Charm++ Performance Framework
How parallel language system operationalizes events is critical to building an effective performance framework Charm++ implements performance callbacks Runtime system calls performance module at events Any registered performance module (client) is invoked Event ID and default performance data forwarded Clients can access to Charm++ internal runtime routines Performance framework exposes set of key runtime events as a base C++ class Performance modules inherit and implement methods Listen only to events of interest Framework calls performance client initialization

24 Charm++ Performance Framework and Modules
Framework allows for separation of concerns Event visibility Event measurement Allows measurement extension and customization New modules may introduce new observation requirements Our TAU module inherits the common trace interface as does Projections and Summary TAU Profiler API 24

25 NAMD Performance Study
Demonstrate integrated analysis in real application NAMD parallel molecular dynamics code Compute interactions between atoms Group atoms in patches Hybrid decomposition Distribute patches to processors Create compute objects to handle interactions between atoms of different patches Performance strategy Distribute computational workload evenly Keep communication to a minimum Several factors: model complexity, size, balancing cost

26 NAMD STMV Performance Main Idle

27 NAMD STMV – Comparative Profile Analysis

28 NAMD Performance Data Mining
Use TAU PerfExplorer data mining tool Dimensionality reduction, clustering, correlation Single profiles and across multiple experiments PmeXPencil PmeZPencil PmeYPencil

29 Heterogeneous Parallel Systems
What does it mean to be heterogeneous? Diverse in character or content (New Oxford America) Not homogeneous (Anonymous) Heterogeneous (computing) technology more accessible Multicore processors Manycore accelerators (e.g., NVIDIA Tesla GPU) High-performance processing engines (e.g., IBM Cell BE) Performance is the main driving concern Heterogeneity is arguably the only path to extreme scale Heterogeneous (hybrid) software technology required Greater performance enables more powerful software Will give rise to more sophisticated software environments

30 Implications for Performance Tools
Tools should support parallel computation models Current status quo is comfortable Mostly homogeneous parallel systems and software Shared-memory multithreading – OpenMP Distributed-memory message passing – MPI Parallel computational models are relatively stable (simple) Corresponding performance models are relatively tractable Parallel performance tools are just keeping up Heterogeneity creates richer computational potential Results in greater performance diversity and complexity Performance tools have to support richer computation models and broader (less constrained) performance perspectives

31 Performance Views of External Execution
Heterogeneous applications have concurrent execution Main “host” path and “external” external paths Want to capture performance for all execution paths External execution may be difficult to measure What perspective does the host have of external entity? Determines the semantics of the measurement data Host regards external execution as a task Tasks operate concurrently with respect to the host Requires support for tracking asynchronous execution Host-side measurements of task events Performance data received from external task (limited) May depend on host for performance data I/O

32 GPU Computing and CUDA Programming
GPU used as accelerators for parallel programs CUDA programming environment Programming of multiple streams and GPU devices multiple streams execute concurrently Programming of data transfers to/from GPU device Programming of GPU kernel code Asynchronous operation synchronization with streams CUDA Profiler Provides extensive stream-level measurements creates post-mortem event trace of kernel operation on streams difficult to merge with application performance data

33 CPU – GPU Execution Scenarios
Synchronous Asynchronous

34 TAU CUDA Performance Measurement (TAUcuda)
Build on CUDA stream event interface Allow “events” to be placed in streams and processed events are timestamped CUDA runtime reports GPU timing in event structure Events are reported back to CPU when requested use begin and end events to calculate intervals Want to associate TAU event context with CUDA events Get top of TAU event stack at begin (TAU context) CUDA kernel invocations are asynchronous CPU does not see actual CUDA “end” event CPU retrieves events in a non-blocking and blocking manner Want to capture “waiting time”

35 CPU-GPU Operation and TAUcuda Events

36 TAU CUDA Measurement API
void tau_cuda_init(int argc, char **argv); To be called when the application starts Initializes data structures and checks GPU status void tau_cuda_exit() To be called before any thread exits at end of application All the CUDA profile data output for each thread of execution void* tau_cuda_stream_begin(char *event, cudaStream_t stream); Called before CUDA statements to be measured Returns handle which should be used in the end call If event is new or the TAU context is new for the event, a new CUDA event profile object is created void tau_cuda_stream_end(void * handle); Called immediately after CUDA statements to be measured Handle identifies the stream Inserts a CUDA event into the stream

37 TAU CUDA Measurement API (2)
vector<Event> tau_cuda_update(); Checks for completed CUDA events on all streams Non-blocking and returns # completed on each stream int tau_cuda_update(cudaStream_t stream); Same as tau_cuda_update() except for a particular stream Non-blocking and returns # completed on the stream vector<Event> tau_cuda_finalize(); Waits for all CUDA events to complete on all streams Blocking and returns # completed on each stream int tau_cuda_finalize(cudaStream_t stream); Same as tau_cuda_finalize() except for a particular stream Blocking and returns # completed on the stream

38 Scenario Results – One and Two Streams
Run simple CUDA experiments to validate TAU CUDA Tesla S1070 test system

39 Case Study: TAUcuda in NAMD
TAU integrated in Charm++ Charm++ applications NAMD is a molecular dynamics application Parallel Framework for Unstructure Meshing (ParFUM) Both have been accelerated with CUDA Demonstration use of TAUcuda Observe the effect of CUDA acceleration Show scaling results for GPU cluster execution Experimental environments Two S1070 GPU servers (Universit of Oregon) AC cluster: 32 nodes, 4 Tesla GPUs per node (UIUC)

40 NAMD GPU Profile (Two GPU Devices)
Test out TAU CUDA with NAMD Two processes with one Tesla GPU for each CPU profile GPU profile (P0) GPU profile (P1)

41 NAMD GPU Efficiency Gain (16 versus 32 GPUs)
AC cluster: 16 and 32 processes dev_sum_forces: 50% improvement dev_nonbounded: 100% improvement Event TAU Context Device Stream

42 Integration of HMPP and TAU / TAUcuda
HMPP Workbench High-level system for programming multi-core and GPU-accelerated systems Portable and retargetable TAU Performance System® Parallel performance instrumentation, measurement, and analysis system Target scalable parallel systems TAUcuda CUDA performance measurement Integrated with TAU HMPP–TAU = HMPP + TAU + TAUcuda

43 Case Study: HMPP-TAU User Application HMPP Runtime C U D A
HMPP CUDA Codelet TAU TAUcuda Measurement User events HMPP events Codelet events Measurement CUDA stream events Waiting information

44 Data/Execution Overlap (Full Environment)

45 High-Productivity Performance Engineering (POINT)
Testbed Apps ENZO NAMD NEMO3D

46 Model Oriented Global Optimization (MOGO)
Empirical performance data evaluated with respect to performance expectations at various levels of abstraction

47 Performance Refactoring (PRIMA) (UO, Juelich)
Integration of instrumentation and measurement Core infrastructure Focus on TAU and Scalasca University of Oregon, Research Centre Juelich Refactor instrumentation, measurement, and analysis Build next-generation tools on new common foundation Extend to involve the SILC project Juelich, TU Dresden, TU Munich

48 What do we need? Performance knowledge engineering
At all levels Support the building of models Represented in form the tools can reason about Understanding of “performance” interactions Between integrated components Control and data interactions Properties of concurrency and dependencies With respect to scientific problem formulation Translate to performance requirements More robust tools that are being used more broadly

49 Conclusion Performance problem solving (process, technology)
Evolve process and technology to meet scaling challenges Closer integration with application development and execution environment Raise the level of performance problem analysis Scalable performance monitoring Performance data mining and expert analysis Whole performance evaluation Model-based performance engineering Performance expectations Support creation of robust performance technology that is available on all HEC system now and in the future

50 Support Acknowledgements
Department of Energy (DOE) Office of Science MICS, Argonne National Lab ASC/NNSA University of Utah ASC/NNSA Level 1 ASC/NNSA, Lawrence Livermore National Lab Department of Defense (DoD) HPC Modernization Office (HPCMO) NSF Software Development for Cyberinfrastructure (SDCI) Research Centre Juelich Los Alamos National Laboratory Technical University Dresden ParaTools, Inc.

51 PerfSuite and TAU Integration

52 PerfSuite and TAU Integration

53 Performance Data Mining / Decision Analytics
Should not just redescribe the performance results Should explain performance phenomena What are the causes for performance observed? What are the factors and how do they interrelate? Performance analytics, forensics, and decision support Need to add knowledge to do more intelligent things Automated analysis needs good informed feedback iterative tuning, performance regression testing Performance model generation requires interpretation We need better methods and tools for Integrating meta-information Knowledge-based performance problem solving Check Dresden talk for common language for these slides

54 Multiple String Alignment (MSA)
Compare protein sequences with unknown function to sequences with known function Progressive alignment Widely used heuristic ClustalW Compute a pairwise distance matrix (90% of the time spent here) Construct a guide tree Progressive alignment along the tree OpenMP parallelism did not scale well Identify relations between factors Photo © NASA

55 MSA OpenMP Load Imbalance
#pragma omp for for (m=first; m<=last; m++) { for (n=m+1; n<=last; n++) { } Inner Loop Outer Loop

56 Analysis Workflow and Inference Rules
For each instrumented region: compute mean, stddev across all threads compute, assert stddev/mean ratio correlate region against all other regions assert correlation assert “severity” of event (exclusive time) Rule1: IF severity(r) > 0.05 AND ratio(r) > 0.25 THEN alert(“load imbalance: r1”) AND assert imbalanced® Rule2: IF imbalanced(r1) AND imbalanced(r2) AND calls (r1,r2) AND correlation(r1,r2) < -0.5 THEN alert(“new schedule suggested: r1, r2”) Event to region Severity -> exclusive time percentage

57 Example PerfExplorer Output
PerfExplorer test script start --- Looking for load imbalances --- Loading Rules… Reading rules: openuh/OpenUHRules.drl... done. loading the data… Main Event: main Firing rules... The event LOOP #3 [file:/mnt/netapp/home1/khuck/openuh/src/fpga/msap.c <63, 163>] has a high load imbalance for metric P_WALL_CLOCK_TIME Mean/Stddev ratio: 0.667, Stddev actual: Percentage of total runtime: 27.15% The event LOOP #2 [file:/mnt/netapp/home1/khuck/openuh/src/fpga/msap.c <65, 158>] has a high load imbalance for metric P_WALL_CLOCK_TIME Mean/Stddev ratio: 0.260, Stddev actual: E7 Percentage of total runtime: 71.40% LOOP #3 [file:/mnt/netapp/home1/khuck/openuh/src/fpga/msap.c <63, 163>] calls LOOP #2 [file:/mnt/netapp/home1/khuck/openuh/src/fpga/msap.c <65, 158>], and they are both showing signs of load imbalance. If these events are in an OpenMP parallel region, consider methods to balance the workload, such as dynamic instead of static work assignment. ...done with rules. PerfExplorer test script end Rule1 true! Rule1 true! Imalanced Name the rules! Rule2 true!

58 MSA – Improved Scaling Before: efficiency < 70% with 16 processors, 400 sequence set After: efficiency > 92.5% with 16 processors, 400 sequence set Efficiency ~= 80% with 128 processors, 1000 sequence set Efficiency ~= 80% with 128 processors, 1000 sequence set Scheduling parameters #pragma omp for schedule (dynamic,1) nowait

59 PGI Compiler for GPUs Accelerator programming support Compiled program
Fortran and C Directive-based programming Loop parallelization for acceleration on GPUs PGI 9.0 for x64-based Linux (preview release) Compiled program CUDA target Synchronous accelerator operations Profile interface support

60 TAU with PGI Accelerator Compiler
Compiler-based instrumentation for PGI compilers Track runtime system events as seen by host processor Wrapped runtime library Show source information associated with events Routine name File name, source line number for kernel Variable names in memory upload, download operations Grid sizes Any configuration of TAU with PGI supports tracking of accelerator operations Tested with PGI 8.0.3, 8.0.5, compilers Qualification and testing with PGI complete


Download ppt "Department of Computer and Information Science"

Similar presentations


Ads by Google