Presentation is loading. Please wait.

Presentation is loading. Please wait.

Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science Institute University.

Similar presentations


Presentation on theme: "Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science Institute University."— Presentation transcript:

1 Allen D. Malony, Sameer Shende {malony,shende}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute University of Oregon Recent Advances in the TAU Performance System

2 LLNL, September 2002Recent Advances in the TAU Performance System 2 Outline  Complexity and performance technology  What is the TAU performance system?  Problems currently being investigated  Instrumentation control and selection  Performance mapping and callpath profiling  Online performance analysis and visualization  Performance analysis for component software  Performance database framework  Concluding remarks

3 LLNL, September 2002Recent Advances in the TAU Performance System 3 Complexity in Parallel and Distributed Systems  Complexity in computing system architecture  Diverse parallel and distributed system architectures  shared / distributed memory, cluster, hybrid, NOW, Grid, …  Sophisticated processor / memory / network architectures  Complexity in parallel software environment  Diverse parallel programming paradigms  Optimizing compilers and sophisticated runtime systems  Advanced numerical libraries and application frameworks  Hierarchical, multi-level software architectures  Multi-component, coupled simulation models

4 LLNL, September 2002Recent Advances in the TAU Performance System 4 Complexity Determines Performance Requirements  Performance observability requirements  Multiple levels of software and hardware  Different types and detail of performance data  Alternative performance problem solving methods  Multiple targets of software and system application  Performance technology requirements  Broad scope of performance observation  Flexible and configurable mechanisms  Technology integration and extension  Cross-platform portability  Open, layered, and modular framework architecture

5 LLNL, September 2002Recent Advances in the TAU Performance System 5 Complexity Challenges for Performance Tools  Computing system environment complexity  Observation integration and optimization  Access, accuracy, and granularity constraints  Diverse/specialized observation capabilities/technology  Restricted modes limit performance problem solving  Sophisticated software development environments  Programming paradigms and performance models  Performance data mapping to software abstractions  Uniformity of performance abstraction across platforms  Rich observation capabilities and flexible configuration  Common performance problem solving methods

6 LLNL, September 2002Recent Advances in the TAU Performance System 6 General Problems (Performance Technology) How do we create robust and ubiquitous performance technology for the analysis and tuning of parallel and distributed software and systems in the presence of (evolving) complexity challenges? How do we apply performance technology effectively for the variety and diversity of performance problems that arise in the context of complex parallel and distributed computer systems?

7 LLNL, September 2002Recent Advances in the TAU Performance System 7 TAU Performance System Framework  Tuning and Analysis Utilities (aka Tools Are Us)  Performance system framework for scalable parallel and distributed high-performance computing  Targets a general complex system computation model  nodes / contexts / threads  Multi-level: system / software / parallelism  Measurement and analysis abstraction  Integrated toolkit for performance instrumentation, measurement, analysis, and visualization  Portable performance profiling/tracing facility  Open software approach

8 LLNL, September 2002Recent Advances in the TAU Performance System 8 TAU Performance System Architecture EPILOG Paraver

9 LLNL, September 2002Recent Advances in the TAU Performance System 9 Instrumentation Control and Selection  Selection of which performance events to observe  Could depend on scope, type, level of interest  Could depend on instrumentation overhead  How is selection supported in instrumentation system?  No choice  Include / exclude lists (TAU)  Environment variables  Static vs. dynamic  Problem: Controlling instrumentation of small routines  High relative measurement overhead  Significant intrusion and possible perturbation

10 LLNL, September 2002Recent Advances in the TAU Performance System 10 Rule-Based Overhead Analysis (N. Trebon, UO)  Analyze the performance data to determine events with high (relative) overhead performance measurements  Create a select list for excluding those events  Rule grammar (used in TAUreduce tool) [GroupName:] Field Operator Number  GroupName indicates rule applies to events in group  Field is a event metric attribute (from profile statistics)  numcalls, numsubs, percent, usec, cumusec, totalcount, stdev, usecs/call, counts/call  Operator is one of >, <, or =  Number is any number  Compound rules possible using & between simple rules

11 LLNL, September 2002Recent Advances in the TAU Performance System 11 Example Rules  #Exclude all events that are members of TAU_USER #and use less than 1000 microseconds TAU_USER:usec < 1000  #Exclude all events that have less than 100 #microseconds and are called only once usec < 1000 & numcalls = 1  #Exclude all events that have less than 1000 usecs per #call OR have a (total inclusive) percent less than 5 usecs/call < 1000 percent < 5  Scientific notation can be used

12 LLNL, September 2002Recent Advances in the TAU Performance System 12 TAUReduce Example  tau_reduce implements overhead reduction in TAU  Consider klargest example  Find kth largest element in a N elements  Compare two methods: quicksort, select_kth_largest  Testcase: i = 2324, N = 1000000 (uninstrumented)  quicksort: (wall clock) = 0.188511 secs  select_kth_largest: (wall clock) = 0.149594 secs  Total: (P3/1.2GHz time) = 0.340u 0.020s 0:00.37  Execute with all routines instrumented  Execute with rule-based selective instrumentation usec>1000 & numcalls>400000 & usecs/call 25

13 LLNL, September 2002Recent Advances in the TAU Performance System 13 Simple sorting example on one processor NODE 0;CONTEXT 0;THREAD 0: --------------------------------------------------------------------------------------- %Time Exclusive Inclusive #Call #Subrs Inclusive Name msec msec usec/call --------------------------------------------------------------------------------------- 100.0 13 4,982 1 4 4982030 int main 93.5 3,223 4,659 4.20241E+06 1.40268E+07 1 void quicksort 62.9 0.00481 3,134 5 5 626839 int kth_largest_qs 36.4 137 1,813 28 450057 64769 int select_kth_largest 33.6 150 1,675 449978 449978 4 void sort_5elements 28.8 1,435 1,435 1.02744E+07 0 0 void interchange 0.4 20 20 1 0 20668 void setup 0.0 0.0118 0.0118 49 0 0 int ceil Before selective instrumentation reduction NODE 0;CONTEXT 0;THREAD 0: --------------------------------------------------------------------------------------- %Time Exclusive Inclusive #Call #Subrs Inclusive Name msec total msec usec/call --------------------------------------------------------------------------------------- 100.0 14 383 1 4 383333 int main 50.9 195 195 5 0 39017 int kth_largest_qs 40.0 153 153 28 79 5478 int select_kth_largest 5.4 20 20 1 0 20611 void setup 0.0 0.02 0.02 49 0 0 int ceil After selective instrumentation reduction

14 LLNL, September 2002Recent Advances in the TAU Performance System 14 Performance Mapping  Associate performance with “significant” entities (events)  Source code points are important  Functions, regions, control flow events, user events  Execution process and thread entities are important  Some entities are more abstract, harder to measure  Consider callgraph (callpath) profiling  Measure time (metric) along an edge (path) of callgraph  incident edge gives parent / child view  edge sequence (path) gives parent / descendant view  Problem: Callpath profiling when callgraph is unknown  Determine callgraph dynamically at runtime  Map performance measurement to dynamic call path state

15 LLNL, September 2002Recent Advances in the TAU Performance System 15 A B C D HG FE I Callgraph (Callpath) Profiling  0-level callpath  Callgraph node  A  1-level callpath  Immediate descendant  A  B, E  I, D  H  C  H ?  k-level callpath (k>1)  k call descendant  2-level: A  D, C  I  2-level: A  I ?  3-level: A  H     

16 LLNL, September 2002Recent Advances in the TAU Performance System 16 1-Level Callpath Profiling in TAU (S. Shende, UO)  TAU maintains a performance event (routine) callstack  Profiled routine (child) looks in callstack for parent  Previous profiled performance event is the parent  A callpath profile structure created first time parent calls  TAU records parent in a callgraph map for child  String representing 1-level callpath used as its key  “a( )=>b( )” : name for time spent in “b” when called by “a”  Map returns pointer to callpath profile structure  1-level callpath is profiled using this profiling data  Build upon TAU’s performance mapping technology  Measurement is independent of instrumentation

17 LLNL, September 2002Recent Advances in the TAU Performance System 17 Callpath Profiling Example (NAS LU v2.3) % configure -PROFILECALLPATH -SGITIMERS -arch=sgi64 -mpiinc=/usr/include -mpilib=/usr/lib64 -useropt=-O2

18 LLNL, September 2002Recent Advances in the TAU Performance System 18 Callpath Parallel Profile Display  0-level and 1-level callpath grouping 1-Level Callpath0-Level Callpath

19 LLNL, September 2002Recent Advances in the TAU Performance System 19 Performance Monitoring and Steering  Desirable to monitor performance during execution  Long-running applications  Steering computations for improved performance  Large-scale parallel applications complicate solutions  More parallel threads of execution producing data  Large amount of performance data (relative) to access  Analysis and visualization more difficult  Problem: Online performance data access and analysis  Incremental profile sampling (based on files)  Integration in computational steering system  Dynamic performance measurement and access

20 LLNL, September 2002Recent Advances in the TAU Performance System 20 Performance Visualizer Performance Analyzer Performance Data Reader SCIRun (Univ. of Utah) accumulated samples Online Performance Analysis (K. Li, UO) // performance data output file system Performance Data Integrator sample sequencing reader synchronization Application TAU Performance System // performance data streams Performance Steering

21 LLNL, September 2002Recent Advances in the TAU Performance System 21 2D Field Performance Visualization in SCIRun SCIRun program

22 LLNL, September 2002Recent Advances in the TAU Performance System 22 Uintah Computational Framework (UCF)  University of Utah  UCF analysis  Scheduling  MPI library  Components  500 processes  Use for online and offline visualization  Apply SCIRun steering

23 LLNL, September 2002Recent Advances in the TAU Performance System 23 Performance Analysis of Component Software  Complexity in scientific problem solving addressed by  advances in software development environments  rich layered software middleware and libraries  Increases complexity in performance problem solving  Integration barriers for performance technology  Incompatible with advanced software technology  Inconsistent with software engineering process  Problem: Performance engineering for component systems  Respect software development methodology  Leverage software implementation technology  Look for opportunities for synergy and optimization

24 LLNL, September 2002Recent Advances in the TAU Performance System 24 Focus on Component Technology and CCA  Emerging component technology for HPC and Grid  Component: software object embedding functionality  Component architecture (CA): how components connect  Component framework: implement a CA  Common Component Architecture (CCA)  Standard foundation for scientific component architecture  Component descriptions  Scientific Interface Description Language (SIDL)  CCA ports for component interactions (provides and uses)  CCA services: directory, registery, connection, event  High-performance components and interactions

25 LLNL, September 2002Recent Advances in the TAU Performance System 25 Extend Component Design for Performance  Compliant with component architecture  Component composition performance engineering  Utilize technology and services of component framework generic component

26 LLNL, September 2002Recent Advances in the TAU Performance System 26 Performance Knowledge  Describe and store “known” component’s performance  Benchmark characterizations in performance database  Models of performance  empirical-based  simulation-based  analytical-based  Saved information about component performance  Use for performance-guided selection and deployment  Use for runtime adaptation  Representation must be in common forms with standard means for accessing the performance information

27 LLNL, September 2002Recent Advances in the TAU Performance System 27 Performance Knowledge Repository & Component  Component performance repository  Implement in component architecture framework  Similar to CCA component repository  Access by component infrastructure  View performance knowledge as component (PKC)  PKC ports give access to performance knowledge  to other components back to original component  Static/dynamic component control and composition  Component composition performance knowledge

28 LLNL, September 2002Recent Advances in the TAU Performance System 28 Performance Observation  Ability to observe execution performance is important  Empirically-derived performance knowledge requires it  does not require measurement integration in component  Monitor during execution to make dynamic decisions  measurement integration is key  Performance observation integration  Component integration: core and variant  Runtime measurement and data collection  On-line and off-line performance analysis  Performance observation technology must be as portable and robust as component software

29 LLNL, September 2002Recent Advances in the TAU Performance System 29 Performance Observation Component (POC)  Performance observation in a performance-engineered component model  Functional extension of original component design ( )  Include new component methods and ports ( ) for other components to access measured performance data  Allow original component to access performance data  encapsulate as tightly-couple and co-resident performance observation object  POC “provides” port allow use optmized interfaces ( ) to access ``internal'' performance observations

30 LLNL, September 2002Recent Advances in the TAU Performance System 30  Each component advertises its services  Performance component:  Timer (start/stop)  Event (trigger)  Query (timers…)  Knowledge (component performance model)  Prototype implementation of timer  CCAFFEINE reference framework  http://www.cca-forum.org/café.html http://www.cca-forum.org/café.html  SIDL  Instantiate with TAU functionality Architecture of a Performance Component Performance Component Timer Event Query Knowledge Ports

31 LLNL, September 2002Recent Advances in the TAU Performance System 31 TimerPort Interface Declaration in CCAFEINE namespace performance{ namespace ccaports{ /** * This abstract class declares the Timer interface. * Inherit from this class to provide functionality. */ class Timer: /* implementation of port */ public virtual gov::cca::Port { /* inherits from port spec */ public: virtual ~ Timer (){ } /** * Start the Timer. Implement this function in * a derived class to provide required functionality. */ virtual void start(void) = 0; /* virtual methods with */ virtual void stop(void) = 0; /* null implementations */... }  Create Timer port abstraction

32 LLNL, September 2002Recent Advances in the TAU Performance System 32 Using Performance Component Timer // Get Timer port from CCA framework services form CCAFFEINE port = frameworkServices->getPort ("TimerPort"); if (port) timer_m = dynamic_cast (port); if (timer_m == 0) { cerr << "Connected to something, not a Timer port" << endl; return -1; } string s = "IntegrateTimer"; // give name for timer timer_m->setName(s); // assign name to timer timer_m->start(); // start timer (independent of tool) for (int i = 0; i < count; i++) { double x = random_m->getRandomNumber (); sum = sum + function_m->evaluate (x); } timer_m->stop(); // stop timer  Component uses framework services to get TimerPort  Use of this TimerPort interface is independent of TAU

33 LLNL, September 2002Recent Advances in the TAU Performance System 33 Using SIDL for Language Interoperability // // File: performance.sidl // version performance 1.0; package performance { class Timer { void start(); void stop(); void setName(in string name); string getName(); void setType(in string name); string getType(); void setGroupName(in string name); string getGroupName(); void setGroupId(in long group); long getGroupId(); }  Can create Timer interface in SIDL for creating stubs

34 LLNL, September 2002Recent Advances in the TAU Performance System 34 Using SIDL Interface for Timers // SIDL: #include "performance_Timer.hh" int main(int argc, char* argv[]) { performance::Timer t = performance::Timer::_create();... t.setName("Integrate timer"); t.start(); // Computation for (int i = 0; i < count; i++) { double x = random_m->getRandomNumber (); sum = sum + function_m->evaluate (x); }... t.stop(); return 0; }  C++ program that uses the SIDL Timer interface  Again, independent of timer implementations (e.g., TAU)

35 LLNL, September 2002Recent Advances in the TAU Performance System 35 Using TAU Component in CCAFEINE repository get TauTimer /* get TAU component from repository */ repository get Driver /* get application components */ repository get MidpointIntegrator repository get MonteCarloIntegrator repository get RandomGenerator repository get LinearFunction repository get NonlinearFunction repository get PiFunction create LinearFunction lin_func /* create component instances */ create NonlinearFunction nonlin_func create PiFunction pi_func create MonteCarloIntegrator mc_integrator create RandomGenerator rand create TauTimer tau /* create TAU component instance */ /* connecting components and running */ connect mc_integrator RandomGeneratorPort rand RandomGeneratorPort connect mc_integrator FunctionPort nonlin_func FunctionPort connect mc_integrator TimerPort tau TimerPort create Driver driver connect driver IntegratorPort mc_integrator IntegratorPort go driver Go quit

36 LLNL, September 2002Recent Advances in the TAU Performance System 36 Component Composition Performance Engineering  Performance of component-based scientific applications depends on interplay  Component functions  Computational resources available  Management of component compositions throughout execution is critical to successful deployment and use  Identify key technological capabilities needed to support the performance engineering of component compositions  Two model concepts  Performance awareness  Performance attention

37 LLNL, September 2002Recent Advances in the TAU Performance System 37 Performance Awareness of Component Ensembles  Composition performance knowledge and observation  Composition performance knowledge  Can come from empirical and analytical evaluation  Can utilize information provided at the component level  Can be stored in repositories for future review  Extends the notion of component observation to ensemble-level performance monitoring  Associate monitoring components hierarchical component grouping  Build upon component-level observation support  Monitoring components act as performance integrators and routers  Use component framework mechanisms

38 LLNL, September 2002Recent Advances in the TAU Performance System 38 Performance Databases  Focus on empirical performance optimization process  Necessary for multi-results performance analysis  Multiple experiments (codes, versions, platforms, …)  Historical performance comparison  Integral component of performance analysis framework  Improved performance analysis architecture design  More flexible and open tool interfaces  Supports extensibility and foreign tool interaction  Performance analysis collaboration  Performance tool sharing  Performance data sharing and knowledge base

39 LLNL, September 2002Recent Advances in the TAU Performance System 39 Empirical-Based Performance Optimization characterization Performance Tuning Performance Diagnosis Performance Experimentation Performance Observation hypotheses properties Experiment Schemas Experiment Trials observability requirements ? Process

40 LLNL, September 2002Recent Advances in the TAU Performance System 40 TAU Performance Database Framework Performance analysis programs Performance analysis and query toolkit  profile data only  XML representation (PerfDML)  project / experiment / trial PerfDML translators... ORDB PostgreSQL PerfDB Performance data description Raw performance data

41 LLNL, September 2002Recent Advances in the TAU Performance System 41 PerfDBF Components  Performance Data Meta Language (PerfDML)  Common performance data representation  Performance meta-data description  Translators to common PerfDML data representation  Performance DataBase (PerfDB)  Standard database technology (SQL)  Free, robust database software (PostgresSQL)  Commonly available APIs  Performance DataBase Toolkit (PerfDBT)  Commonly used modules for query and analysis  Facility analysis tool development

42 LLNL, September 2002Recent Advances in the TAU Performance System 42 Common and Extensible Profile Data Format  Goals  Capture data from profile tools in common representation  Implement representation in a standard format  Allow for extension of format for new profile data objects  Base on XML (obvious choice)  Leverage XML tools and APIs  XML parsers, Sun’s Java SDK, …  XML verification systems (DTD and schemas)  Target for profile data translation tools  eXtensibile Stylesheet Language Transformations (XSLT)  Which performance profile data are of interest?  Focus on TAU and consider other profiling tools

43 LLNL, September 2002Recent Advances in the TAU Performance System 43 Performance Profiling  Performance data about program entities and behaviors  Code regions: functions, loops, basic blocks  Actions or states  Statistics data  Execution time, number of calls, number of FLOPS...  Characterization data  Parallel profiles  Captured per process and/or per thread  Program-level summaries  Profiling tools  prof/gprof, ssrun, uprofile/dpci, cprof/vprof, …

44 LLNL, September 2002Recent Advances in the TAU Performance System 44 TAU Parallel Performance Profiles

45 LLNL, September 2002Recent Advances in the TAU Performance System 45 PerfDBF Example  NAS Parallel Benchmark LU  % configure -mpiinc=/usr/include -mpilib=/usr/lib64 -arch=sgi64 -fortran=sgi -SGITIMERS -useropt=-O2 NPB profiled With TAU Standard TAU Output Data TAU XML Format SQL Database Analysis Tool TAU to XML Converter Database Loader

46 LLNL, September 2002Recent Advances in the TAU Performance System 46 Scalability Analysis Process  Scalability study on LU  Vary number of processes: 1, 2, 4, and 8  % mpirun -np 1 lu.W1  % mpirun -np 2 lu.W2  % mpirun -np 4 lu.W4  % mpirun -np 8 lu.W8  Populate the performance database  run Java translator to translate profiles into XML  run Java XML reader to write XML profiles to database  Read times for routines and program from experiments  Calculate scalability metrics

47 LLNL, September 2002Recent Advances in the TAU Performance System 47  Raw data output  One processor: "applu ”1152939.096923828125248744666.58300780GROUP="applu“  Four processors: "applu ”1152227.34399414062551691412.177978520GROUP="applu“ "applu " 114596.56811523437551691519.341064450GROUP="applu“ "applu " 114616.83325195312551691377.213134770GROUP="applu" Raw TAU Profile Data name calls exclusive time subs inclusive time profile calls group name

48 LLNL, September 2002Recent Advances in the TAU Performance System 48 XML Profile Representation  One processor 'applu ' 8 100.0 2.487446665830078E8 0.0 2939.096923828125 1 15 2.487446665830078E8

49 LLNL, September 2002Recent Advances in the TAU Performance System 49 XML Representation  Four processor mean 'applu ' 12 100.0 5.169148940026855E7 0.0 1044.487548828125 1 14.25 5.1691489E7

50 LLNL, September 2002Recent Advances in the TAU Performance System 50 Contents of Performance Database

51 LLNL, September 2002Recent Advances in the TAU Performance System 51 Scalability Analysis Results  Scalability of LU performance experiments  Four trial runs Funname| processors| meanspeedup …. applu| 2| 2.0896117809566 applu| 4| 4.812100975788783 applu| 8| 8.168409581149514 … exact| 2| 1.95853126762839071803 exact| 4| 4.03622321124616535446 exact| 8| 7.193812137750623668346

52 LLNL, September 2002Recent Advances in the TAU Performance System 52 Current PerfDBF Status and Future  PerfDBF prototype  TAU profile to XML translator  XML to PerfDB populator  PostgresSQL database  Java-based PostgresSQL query module  Use as a layer to support performance analysis tools  Make accessing the Performance Database quicker  Continue development  XML parallel profile representation  Basic specification  Opportunity for APART to define a common format

53 LLNL, September 2002Recent Advances in the TAU Performance System 53 Performance Tracking and Reporting  Integrated performance measurement allows performance analysis throughout development lifetime  Applied performance engineering in software design and development (software engineering) process  Create “performance portfolio” from regular performance experimentation (couple with software testing)  Use performance knowledge in making key software design decision, prior to major development stages  Use performance benchmarking and regression testing to identify irregularities  Support automatic reporting of “performance bugs”  Enable cross-platform (cross-generation) evaluation

54 LLNL, September 2002Recent Advances in the TAU Performance System 54 XPARE - eXPeriment Alerting and REporting  Experiment launcher automates measurement / analysis  Configuration and compilation of performance tools  Instrumentation control for Uintah experiment type  Execution of multiple performance experiments  Performance data collection, analysis, and storage  Integrated in Uintah software testing harness  Reporting system conducts performance regression tests  Apply performance difference thresholds (alert ruleset)  Alerts users via email if thresholds have been exceeded  Web alerting setup and full performance data reporting  Historical performance data analysis

55 LLNL, September 2002Recent Advances in the TAU Performance System 55 XPARE System Architecture Experiment Launch Mail server Performance Database Performance Reporter Comparison Tool Regression Analyzer Alerting Setup Web server

56 LLNL, September 2002Recent Advances in the TAU Performance System 56 Concluding Remarks  Complex software and parallel computing systems pose challenging performance analysis problems that require robust methodologies and tools  To build more sophisticated performance tools, existing proven performance technology must be utilized  Performance tools must be integrated with software and systems models and technology  Performance engineered software  Function consistently and coherently in software and system environments  TAU performance system offers robust performance technology that can be broadly integrated

57 LLNL, September 2002Recent Advances in the TAU Performance System 57


Download ppt "Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science Institute University."

Similar presentations


Ads by Google