© 2004 Wayne Wolf Topics Goals of program performance anlaysis. Worst-case execution time analysis. Execution-based timing analysis.

© 2004 Wayne Wolf CPUs and software performance System performance cannot be determined without choosing a CPU Software execution times definitely don’t scale across CPU architectures, perhaps not even within a CPU family Architectural features which influence program performance:  pipelining  caching  bus bandwidth Ceramic or plastic is a key design decision.

© 2004 Wayne Wolf Hierarchical performance modeling We would like to have a hierarchy of increasingly accurate performance models, from system specification to assembly code. Very little work in C-level performance modeling- --hard to separate the program from the CPU. Two types of questions:  which variety of Brand X CPU do I use?  should I use Brand X or Brand Y?

© 2004 Wayne Wolf Caches and code speed Worst-case:  tight-deadline device interrupts  driver is not in cache  multiple high-priority drivers knock each other out of the cache Cache miss costs from a few cycles on up; the faster the CPU, the more costly is a miss Worst-case execution time is much larger than best-case, leading to extreme overengineering.

© 2004 Wayne Wolf Alternative approaches to performance analysis Conservative analysis---performance always within bounds.  WCET gives bounds from analysis, limited simulation. Detailed analysis---more info on a particular case but no bounds.  Execution-based methods provide lots of details but only for given input data.

© 2004 Wayne Wolf Performance of HLLs We would like to bound or estimate program performance from high-level code:  simplifies identification of paths  provides early performance estimates Realistically, we need to know the execution platform.

© 2004 Wayne Wolf Paths and performance Branches of conditional have different execution times. Loops:  Multiple iterations.  Varying number of iterations. F1() F2() i<N if Loop start t f Loop body t f

© 2004 Wayne Wolf Measurements of interest Execution time bounds: worst, best.  Upper/lower bounds important in multitasking systems. Execution time of incomplete code.  Must be able to handle time estimates for pieces of code. Bounds of varying quality:  Loose bounds quickly.  Tight bounds with more effort.

© 2004 Wayne Wolf Early work: explicit path analysis Shaw developed techniques to prove bounds on the number of times through paths. Park and Shaw developed techniques for path analysis and for measurement of execution times of HLL statements on a 68000.

© 2004 Wayne Wolf Challenges and approaches Exponential number of paths.  Limit program constructs.  Add annotations.  Implicit path analysis. Instruction times are not independent: pipeline effects; cache effects.  State-dependent instruction execution time.  Simulation.

© 2004 Wayne Wolf Abstract program flow anlaysis Bound set of feasible paths without exhaustive simulation.  May include some infeasible paths in the feasible set. Perform abstract interpretation of the program to find feasible paths.  Generate safe bounds on the values of variables.

© 2004 Wayne Wolf Abstract interpretation example (Engblom et al.) while (x<4) { S1if (x<3) x = x*2; S2else x = x+1; S3if (x == 1) x = x+2; S4else x=x+1; } Input limits: 0 <= x <= 3 start: 0 <= x <= 2; end: 0 <= x <= 4 start: 3 <= x <= 3; end: 4 <= x <= 4 start: 0 <= x <= 4; end: 1 <= x <= 5start: 3 <= x <= 3; end: 5 <= x <= 5 start: 0 <= x <= 4; end: 3 <= x <= 3 start: 3 <= x <= 3; INFEASIBLE

© 2004 Wayne Wolf Example (cont’d.) X = [0..3] S1S2 X = [0..4] S3S4 X = [4..4] S3S4 X = [3..3]X = [1..3]X = [5..5] X = [1..5] merge Iteration 1 Iteration 2 X = [1..3]X = [4..5] S1S2 … X = [2..4]X = [4..4] …

© 2004 Wayne Wolf Bounding loop iterations Uppsalla/FSU: handle complex loops. Four phases: 1. Iteratively identify branches that affect the number of iteraitons. 2. Identify loop iteration on which loop index- dependent branches change direction. 3. Use step 2 to determine when step 1 branches are reached. 4. Calculate bounds on number of iterations.

© 2004 Wayne Wolf Loop iteration example (Engblom et al) for (i=0, j=1; i<100; i++, j+=3) { if (j>75 and somecond || j>300) break; } i=0, j=1 jump j>75, jump if false somecond, jump if false return j>300, jump if true i++, j+=3 j<100, jump if false jump iteration 26 iteration 101 Lower bound: 26 Upper bound: 101 Redundant code

© 2004 Wayne Wolf Implicit path analysis Schedl, Li/Malik—find path length without explicitly finding path. Formulate as constraint solving problem:  Generate constraints that describe program, annotations.  Solve using constraint solver, ILP (depending on types of constraints allowed).

© 2004 Wayne Wolf Types of constraints Structural constraints:  Describe conditionals, etc. Finiteness and start constraints:  Bound loop iterations. Tightening constraints:  Infeasible paths.  User constraints.

© 2004 Wayne Wolf WCET and optimizing compilers Optimizing compilers can radically change program control flow. Must analyze timing of the optimized code.  Annotations must be transferred to the optimized code.  Must be able to perform the program transformations on the optimizations.

© 2004 Wayne Wolf Cache analysis extensions Must segment program units around cache lines. Different execution times for in-cache and out-of- cache. Conservative assumption: use in-cache time only if statement is known to be in cache. Add constraints which model cache state based on program flow.

© 2004 Wayne Wolf Cache analysis model (Colin, Puaut) Instruction block (iblock): a basic block fragment that fits into a cache line.  Decomposition of program into iblocks depends on cache organization. Determine paths on which iblocks result in hits, misses.

© 2004 Wayne Wolf Branch prediction bounding (Colin & Puaut) Missed branch prediction causes pipeline bubble.  Branch predictor has finite capacity.  Predictor may make wrong prediction. Keep track of branch history.  Memoryless predictors are a special case. Determine what prediction the machine will make to determine whether a bubble may be caused.  Known correct predictions cause no bubble.  Known incorrect predictions cause a bubble.  Indeterminate results are pessimistically presumed to cause a bubble.

© 2004 Wayne Wolf Data caching Data address may not be known at compile time:  Pointers.  Stack variables. Caching of stack variables is easier to compute.  Offset from stack pointer is known.  Can compute cache block based upon sp offset.

© 2004 Wayne Wolf Timing through simulation Use a simulator to time a sequence of instructions.  Can simulate basic blocks with boundary conditions for branches. WCET tools often use custom simulators designed for small pieces of code, call by subroutine.

© 2004 Wayne Wolf Pipeline interactions Basic block execution time depends on the path.  Next block may or may not overlap with current block. Overlap model: execution time of block includes (variable) overlap gains from next block. BB1 BB2BB3 T1(BB1) T2(BB1,BB2)T2(BB1,BB3)

© 2004 Wayne Wolf Behavioral performance analysis Use program behavior to analyze performance. Advantages:  Handles arbitrary program.  Captures realistic behavior. Disadvantages:  Doesn’t guarantee worst-case/best-case behavior.

© 2004 Wayne Wolf PC sampling Example: Unix prof. Interrupts are used to sample PC periodically.  Must run on the platform.  Doesn’t provide complete trace.  Subject to sampling problems: undersampling, periodicity problems.

© 2004 Wayne Wolf Direct execution Model target programming model within the simulation host.  May need to use variables to capture registers not available in host.  May need to generate extra instructions to generate the non-native state. Use simulation host instructions to model target instructions.  Mapping may be many host->one target.

© 2004 Wayne Wolf Cycle-accurate simulator Models the microarchitecture.  Simulating one instruction requires executing routines for instruction decode, etc. Models pipeline state.  Microarchitectural registers are exposed to the simulator. reg IR PC I-box

© 2004 Wayne Wolf Trace-based vs. execution-based Trace-based:  Gather trace first, then generate timing information.  Basic timing information is simpler to generate.  Full timing information may require regenerating information from the original execution.  Requires owning the platform. Execution-based:  Simulator fully executes the instruction.  Requires a more complex simulator.  Requires explicit knowledge of the microarchitecture, not just instruction execution times.

© 2004 Wayne Wolf Sources of timing information Data book tables:  Time of individual instructions.  Penalties for various hazards. Microarchitecture:  Depends from the structure of machine.  Derived from execution of the instruction in the microarchitecture.

© 2004 Wayne Wolf Levels of detail in simulation Instruction schedulers:  Models availability of microarchitectural resources.  May not capture all interactions. Cycle timers:  Models full microarchitecture.  Most accurate, requires exact model of the microarchitecture.

© 2004 Wayne Wolf Modular simulators Model instructions through a description file.  Drives assembler, basic behavioral simulation. Assemble a simulation program from code modules.  Can add your own code.

© 2004 Wayne Wolf Power simulation Model capacitance in the processor. Keep track of activity in the processor.  Requires full simulation. Activity determines capacitive charge/discharge, which determines power consumption.

© 2004 Wayne Wolf SimplePower simulator Cycle-accurate simulator.  SimpleScalar-style cycle-accurate simulator. Transition-based power analysis.  Estimates energy of data path, memory, and busses on every clock cycle.

© 2004 Wayne Wolf RTL power estimation interface A power estimator is required for each functional unit modeled in the simulator.  Functional interface makes the simulator more modular. Power estimator takes same arguments as the performance simulation module.

© 2004 Wayne Wolf Switch capacitance tables Model functional units such as ALU, register files, multiplexers, etc. Capture technology-dependent capacitance of the unit. Two types of model:  Bit-independent: each bit is independent, model is one bit wide.  Bit-dependent: bits interact (as in adder), model must be multiple bits. Analytical models used for memories. Adder model is built from sub-model for adder slice.

© 2004 Wayne Wolf Array model Analytical model:  Decoder.  Wordline drive.  Bitline discharge.  Sense amp output. Register file word line capacitance:  C diff (word line driver) + C gate (cell access)*n bit_lines + C metal * Word_line_length

© 2004 Wayne Wolf Clock network power model Clock is a major power sink in modern designs. Major elements of the clock power model:  Global clock lines.  Global drivers.  Loads on the clock network. Must handle gated clocks.

© 2004 Wayne Wolf Topics Goals of program performance anlaysis. Worst-case execution time analysis. Execution-based timing analysis.

Similar presentations

Presentation on theme: "© 2004 Wayne Wolf Topics Goals of program performance anlaysis. Worst-case execution time analysis. Execution-based timing analysis."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

© 2004 Wayne Wolf Topics Goals of program performance anlaysis. Worst-case execution time analysis. Execution-based timing analysis.

Similar presentations

Presentation on theme: "© 2004 Wayne Wolf Topics Goals of program performance anlaysis. Worst-case execution time analysis. Execution-based timing analysis."— Presentation transcript:

Similar presentations

About project

Feedback