Presentation on theme: "Performance Analysis and Optimization through Run-time Simulation and Statistics Philip J. Mucci University Of Tennessee"— Presentation transcript:
Performance Analysis and Optimization through Run-time Simulation and Statistics Philip J. Mucci University Of Tennessee firstname.lastname@example.org http://www.cs.utk.edu/~mucci
Motivation Tuning real DOD and DOE applications! –Performance on most codes is low. –Poor overall efficiency due to poor single node performance. –Show good scalability because of the above and faster interconnects. –The expertise is not there, nor should it be.
Description To use data available at run-time to better compilation and optimization technology. Empirically determine how well the code maps to the underlying architecture. Bottlenecks can be identified and possibly corrected by an explicit set of rules and transformations.
Information not being used Hardware statistics gathered through simulation or monitoring can identify the problem. (sample listing) –Cache and branching behavior –Cycle/Load/Store/FLOP counts –Bottleneck determination –Reference pattern –Dynamic memory placement
Problem Areas Efficient use of the memory hierarchy Register re-use Aliasing Inlining Demotion Algorithms (iterative vs. direct)
Increasing Cache Performance How do we better the use of the memory hierarchy? For computer scientists, its not that hard. We need the right tools. How much can we automate? Through available tools and source analysis we can usually get down to the function.
Cache Simulation Instrumentation of routines Run of the executable Analysis and correlation with source code! Old idea, new implementation.
Cache Simulation Hardware independence Information on: –Locality –Placement –Reference pattern and Reuse –Line usage
Locality Spatial and Temporal –misses/memory reference –misses/re-use Conflict vs. Capacity
Placement Padding can be very important Not always possible to do during static analysis phase. Reference pattern can affect padding.
Reference Pattern Again, not always possible to do during static analysis. Even harder to analyze when dealing with pseudo-optimized code. Examples: Stencils, Sparse solvers etc...
Reuse Blocking is critical to applications where there is re-use. We need to identify re-use potential, to spot areas where blocking and register allocation should be focused on.
Source Code Mapping Most cache tools are hard to use and relate to the source code. This tool simulates the cache(s) on each memory reference and thus can easy correlate the data. Instrumentation is at the source level, not object code.
Statistics Global, per file, per statement, per reference References, misses, cold misses, re-used references Conflict/Re-use matrix –M(A,B) = x means some element of A ejected some element of B from the cache x times iff that element of A has been in the cache before.
Development status GUI for selective instrumentation Real parsers (F90, C, C++) Better report generation
Implementation Simulator written in C Instrumentation in Perl GUI in Java Report generator in Perl
Relevance Why shouldnt this technology be part of a feedback loop? –Compile with instrumentation –Run –Recompile with information from the run –Watch input sensitivity issues.
Integration Identifying and correcting poor cache behavior can be made explicit and part of a compiler. (Ideally a source-to-source transformer or preprocessor) Simulator can stand alone for detailed analysis and optimization by CS folks. Our knowledge and expertise made available through the tools.
Hardware Counters Virtually every processor available has hardware counters The interfaces and documentation are poor or non-existent. Hardware differs greatly as do the semantics Useful for measurement, analysis, optimization, modeling and benchmarking.
Performance Data Standard Standardize an API to obtain hardware performance counters Standardize the definitions of what those counters mean API is lightweight and portable
Performance Data Standard Target platforms –R10K, R12K –P2SC, Power PC 604e, Power 3 –Sun Ultra 2/3 –Intel PII, Katmai, Merced –Alpha 21164, 21264
Performance Data Standard Motivation –Portable performance tools –Optimization through feedback –Developers wanting simple and accurate timing and statistics –Modeling, evaluation
Performance Data Standard Small number of useful measurement points –Timing cycles, microseconds –I/D cache misses, invalidations –Branch mispredictions –Load,store,FLOP,instruction counts –I/D TLB misses
Performance Data Standard API Efficient counter multiplexing Thread safety Functions for –start, stop, reset, get, accumulate, query, control Use the best available vendor supported interface or API Possible pairing with DAIS, Dyninst for naming
Development status Research on the various machines available hardware and interfaces Compilation of findings, web page and mailing list API specification to appear mid August for discussion Vendors are lurking http://www.cs.utk.edu/~mucci/pdsa
Deliverables API for O2K, T3E, SP Portable prof implementation
People Shirley Browne (UT) Jeff Brown (LANL) Jeff Durachta (IBM, LANL) Christopher Kerr (IBM, LANL) George Ho (UT) Kevin London (UT) Philip Mucci (UT, Sandia)