Presentation is loading. Please wait.

Presentation is loading. Please wait.

Allen D. Malony, Sameer Shende, Li Li, Kevin Huck Department of Computer and Information Science Performance.

Similar presentations


Presentation on theme: "Allen D. Malony, Sameer Shende, Li Li, Kevin Huck Department of Computer and Information Science Performance."— Presentation transcript:

1 Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance Research Laboratory University of Oregon Parallel Performance Mapping, Diagnosis, and Data Mining

2 ParCo 20052 Research Motivation  Tools for performance problem solving  Empirical-based performance optimization process  Performance technology concerns characterization Performance Tuning Performance Diagnosis Performance Experimentation Performance Observation hypotheses properties Instrumentation Measurement Analysis Visualization Performance Technology Experiment management Performance storage Performance Technology

3 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 20053 Challenges in Performance Problem Solving  How to make the process more effective (productive)?  Process may depend on scale of parallel system  What are the important events and performance metrics?  Tied to application structure and computational model  Tied to application domain and algorithms  Process and tools can/must be more application-aware  Tools have poor support for application-specific aspects  What are the significant issues that will affect the technology used to support the process?  Enhance application development and benchmarking  New paradigm in performance process and technology

4 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 20054 Large Scale Performance Problem Solving  How does our view of this process change when we consider very large-scale parallel systems?  What are the significant issues that will affect the technology used to support the process?  Parallel performance observation is clearly needed  In general, there is the concern for intrusion  Seen as a tradeoff with performance diagnosis accuracy  Scaling complicates observation and analysis  Performance data size becomes a concern  Analysis complexity increases  Nature of application development may change

5 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 20055 Role of Intelligence, Automation, and Knowledge  Scale forces the process to become more intelligent  Even with intelligent and application-specific tools, the decisions of what to analyze is difficult and intractable  More automation and knowledge-based decision making  Build automatic/autonomic capabilities into the tools  Support broader experimentation methods and refinement  Access and correlate data from several sources  Automate performance data analysis / mining / learning  Include predictive features and experiment refinement  Knowledge-driven adaptation and optimization guidance  Will allow scalability issues to be addressed in context

6 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 20056 Outline of Talk  Performance problem solving  Scalability, productivity, and performance technology  Application-specific and autonomic performance tools  TAU parallel performance system (Bernd said “No!”)  Parallel performance mapping  Performance data management and data mining  Performance Data Management Framework (PerfDMF)  PerfExplorer  Model-based parallel performance diagnosis  Poirot and Hercule  Conclusions

7 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 20057 TAU Performance System event selection

8 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 20058 Semantics-Based Performance Mapping  Associate performance measurements with high-level semantic abstractions  Need mapping support in the performance measurement system to assign data correctly

9 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 20059 Hypothetical Mapping Example  Particles distributed on surfaces of a cube Particle* P[MAX]; /* Array of particles */ int GenerateParticles() { /* distribute particles over all faces of the cube */ for (int face=0, last=0; face < 6; face++){ /* particles on this face */ int particles_on_this_face = num(face); for (int i=last; i < particles_on_this_face; i++) { /* particle properties are a function of face */ P[i] =... f(face);... } last+= particles_on_this_face; }

10 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 200510 Hypothetical Mapping Example (continued)  How much time (flops) spent processing face i particles?  What is the distribution of performance among faces?  How is this determined if execution is parallel? int ProcessParticle(Particle *p) { /* perform some computation on p */ } int main() { GenerateParticles(); /* create a list of particles */ for (int i = 0; i < N; i++) /* iterates over the list */ ProcessParticle(P[i]); } … engine work packets

11 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 200511 No Performance Mapping versus Mapping  Typical performance tools report performance with respect to routines  Does not provide support for mapping  TAU’s performance mapping can observe performance with respect to scientist’s programming and problem abstractions TAU (no mapping) TAU (w/ mapping)

12 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 200512  ParaMap (Miller and Irvin)  Low-level performance to high-level source constructs  Noun-Verb (NV) model to describe the mapping  noun is an program entity  verb represents an action performed on a noun  sentences (nouns and verb) map to other sentences  Mappings: static, dynamic, set of active sentences (SAS)  Semantics Entities / Abstractions/ Associations (SEAA)  Entities defined at any level of abstraction (user-level)  Attribute entity with semantic information  Entity-to-entity associations  Target measurement layer and asynchronous operation Performance Mapping Approaches

13 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 200513  Two association types (implemented in TAU API)  Embedded – extends associated object to store performance measurement entity  External – creates an external look-up table using address of object as key to locate performance measurement entity  Implemented in TAU API  Applied to performance measurement problems  callpath/phase profiling, C++ templates, … SEAA Implementation …

14 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 200514 Uintah Problem Solving Environment (PSE)  Uintah component architecture for Utah C-SAFE project  Application programmers provide:  description of computation (tasks and variables)  code to perform task on single “patch” (sub-region of space)  Components for scheduling, partitioning, load balance, …  Uintah Computational Framework (UCF)  Execution model based on software (macro) dataflow  computations expressed a directed acyclic graphs of tasks  input/outputs specified for each patch in a structured grid  Abstraction of global single-assignment memory  Task graph gets mapped to processing resources  Communications schedule approximates global optimal

15 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 200515 Uintah Task Graph (Material Point Method)  Diagram of named tasks (ovals) and data (edges)  Imminent computation  Dataflow-constrained  MPM  Newtonian material point motion time step  Solid: values defined at material point (particle)  Dashed: values defined at vertex (grid)  Prime (’): values updated during time step

16 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 200516 Task Execution in Uintah Parallel Scheduler  Profile methods and functions in scheduler and in MPI library  Need to map performance data! Task execution time dominates (what task?) MPI communication overheads (where?) Task execution time distribution

17 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 200517 Mapping Instrumentation in UCF (example)  Use TAU performance mapping API void MPIScheduler::execute(const ProcessorGroup * pc, DataWarehouseP & old_dw, DataWarehouseP & dw ) {... TAU_MAPPING_CREATE( task->getName(), "[MPIScheduler::execute()]", (TauGroup_t)(void*)task->getName(), task->getName(), 0);... TAU_MAPPING_OBJECT(tautimer) TAU_MAPPING_LINK(tautimer,(TauGroup_t)(void*)task->getName()); // EXTERNAL ASSOCIATION... TAU_MAPPING_PROFILE_TIMER(doitprofiler, tautimer, 0) TAU_MAPPING_PROFILE_START(doitprofiler,0); task->doit(pc); TAU_MAPPING_PROFILE_STOP(0);... }

18 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 200518 Task Performance Mapping (Profile) Performance mapping for different tasks Mapped task performance across processes

19 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 200519 Work Packet – to – Task Mapping (Trace) Work packet computation events colored by task type Distinct phases of computation can be identifed based on task

20 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 200520 Comparing Uintah Traces for Scalability Analysis 8 processes 32 processes

21 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 200521 Important Questions for Application Developers  How does performance vary with different compilers?  Is poor performance correlated with certain OS features?  Has a recent change caused unanticipated performance?  How does performance vary with MPI variants?  Why is one application version faster than another?  What is the reason for the observed scaling behavior?  Did two runs exhibit similar performance?  How are performance data related to application events?  Which machines will run my code the fastest and why?  Which benchmarks predict my code performance best?

22 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 200522 Performance Problem Solving Goals  Answer questions at multiple levels of interest  Data from low-level measurements and simulations  use to predict application performance  High-level performance data spanning dimensions  machine, applications, code revisions, data sets  examine broad performance trends  Discover general correlations application performance and features of their external environment  Develop methods to predict application performance on lower-level metrics  Discover performance correlations between a small set of benchmarks and a collection of applications that represent a typical workload for a given system

23 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 200523 Empirical-Based Performance Optimization characterization Performance Tuning Performance Diagnosis Performance Experimentation Performance Observation hypotheses properties observability requirements ? Process Experiment Schemas Experiment Trials Experiment management

24 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 200524 Performance Data Management Framework  ICPP 2005 paper

25 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 200525 PerfExplorer (K. Huck, Ph.D. student, UO)  Performance knowledge discovery framework  Use the existing TAU infrastructure  TAU instrumentation data, PerfDMF  Client-server based system architecture  Data mining analysis applied to parallel performance data  comparative, clustering, correlation, dimension reduction,...  Technology integration  Relational DatabaseManagement Systems (RDBMS)  Java API and toolkit  R-project / Omegahat statistical analysis  WEKA data mining package  Web-based client

26 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 200526 PerfExplorer Architecture  SC’05 paper

27 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 200527 PerfExplorer Client GUI

28 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 200528 Hierarchical and K-means Clustering (sPPM)

29 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 200529 Miranda Clustering on 16K Processors

30 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 200530 Parallel Performance Diagnosis  Performance tuning process  Process to find and report performance problems  Performance diagnosis: detect and explain problems  Performance optimization: performance problem repair  Experts approach systematically and use experience  Hard to formulate and automate expertise  Performance optimization is fundamentally hard  Focus on the performance diagnosis problem  Characterize diagnosis processes  How it integrates with performance experimentation  Understand the knowledge engineering

31 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 200531 Parallel Performance Diagnosis Architecture

32 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 200532 Performance Diagnosis System Architecture

33 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 200533 Problems in Existing Diagnosis Approaches  Low-level abstraction of properties/metrics  Independent of program semantics  Relate to component structure  not algorithmic structure or parallelism model  Insufficient explanation power  Hard to interpret in the context of program semantics  Performance behavior not tied to operational parallelism  Low applicability and adaptability  Difficult to apply in different contexts  Hard to adapt to new requirements

34 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 200534 Poirot Project  Lack of a formal theory of diagnosis processes  Compare and analyze performance diagnosis systems  Use theory to create system that is automated / adaptable  Poirot performance diagnosis (theory, architecture)  Survey of diagnosis methods / strategies in tools  Heuristic classification approach (match to characteristics)  Heuristic search approach (based on problem knowledge)  Problems  Descriptive results do not explain with respect to context  users must reason about high-level causes  Performance experimentation not guided by diagnosis  Lacks automation

35 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 200535 Model-Based Approach  Knowledge-based performance diagnosis  Capture knowledge about performance problems  Capture knowledge about how to detect and explain them  Where does the knowledge come from?  Extract from parallel computational models  Structural and operational characteristics  Associate computational models with performance  Do parallel computational models help in diagnosis?  Enables better understanding of problems  Enables more specific experimentation  Enables more efffective hypothesize testing and search

36 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 200536 Implications for Performance Diagnosis  Models benefit performance diagnosis  Base instrumentation on program semantics  Capture performance-critical features  Enable explanations close to user’s understanding  of computation operation  of performance behavior  Reuse performance analysis expertise  on the commonly-used models  Model examples  Master-worker model  Pipeline  Divide-and-conquer  Domain decomposition  Phase-based  Compositional

37 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 200537 Hercule Project  Goals of automation, adaptability, validation

38 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 200538 Approach  Make use of model knowledge to diagnose performance  Start with commonly-used computational models  Engineering model knowledge  Integrate model knowledge with performance measurement system  Build a cause inference system  define “causes” at parallelism level  build causality relation between the low-level “effects” and the “causes”

39 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 200539 Master-Worker Parallel Computation Model

40 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 200540 Init. or final. time significant 1.Insufficient-parallelism Low speedup 3.Master-being-bottleneck Worker number saturation Worker starvation master-assign-task time significant 2.Fine-granularity Large amount of message exchanged every time : Hypotheses : Causes number : priority Num of reqs in master queue > Κ 1 in some time intervals Waiting long time for Master assigning each individual task Such intervals >Κ 2 Such intervals <Κ 2 + + + + Time imbalance 4. Some workers Noticeably inefficient + Κ i : threshold + : coexistence : Observation The workers waited quite a while in master queue in Some time intervals + Performance Diagnosis Inference Tree (MW)

41 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 200541 Knowledge Engineering - Abstract Event (MW) Use CLIPS expert system building tool

42 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 200542 Diagnosis Results Output (MW)

43 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 200543 Experimental Diagnosis Results (MW)

44 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 200544 Concluding Discussion  Performance tools must be used effectively  More intelligent performance systems for productive use  Evolve to application-specific performance technology  Deal with scale by “full range” performance exploration  Autonomic and integrated tools  Knowledge-based and knowledge-driven process  Performance observation methods do not necessarily need to change in a fundamental sense  More automatically controlled and efficiently use  Support model-driven performance diagnosis  Develop next-generation tools and deliver to community

45 Parallel Performance Mapping, Diagnosis, and Data MiningParCo 200545 Support Acknowledgements  Department of Energy (DOE)  Office of Science contracts  University of Utah ASCI Level 1 sub-contract  ASC/NNSA Level 3 contract  NSF  High-End Computing Grant  Research Centre Juelich  John von Neumann Institute  Dr. Bernd Mohr  Los Alamos National Laboratory


Download ppt "Allen D. Malony, Sameer Shende, Li Li, Kevin Huck Department of Computer and Information Science Performance."

Similar presentations


Ads by Google