Presentation is loading. Please wait.

Presentation is loading. Please wait.

Program design and analysis zProgram-level performance analysis. zOptimizing for: yExecution time. yEnergy/power. yProgram size. zProgram validation and.

Similar presentations


Presentation on theme: "Program design and analysis zProgram-level performance analysis. zOptimizing for: yExecution time. yEnergy/power. yProgram size. zProgram validation and."— Presentation transcript:

1 Program design and analysis zProgram-level performance analysis. zOptimizing for: yExecution time. yEnergy/power. yProgram size. zProgram validation and testing.

2 Program-level performance analysis zNeed to understand performance in detail: yReal-time behavior, not just typical. yOn complex platforms.  Program performance  CPU performance: yPipeline, cache are windows into program. yWe must analyze the entire program.

3 Complexities of program performance zVaries with input data: yDifferent-length paths. zCache effects. zInstruction-level performance variations: yPipeline interlocks. yFetch times.

4 How to measure program performance zSimulate execution of the CPU. yMakes CPU state visible. zMeasure on real CPU using timer. yRequires modifying the program to control the timer. zMeasure on real CPU using logic analyzer. yRequires events visible on the pins.

5 Program performance metrics zAverage-case execution time. yTypically used in application programming. zWorst-case execution time. yA component in deadline satisfaction. zBest-case execution time. yTask-level interactions can cause best-case program behavior to result in worst-case system behavior.

6 Elements of program performance zBasic program execution time formula: yexecution time = program path + instruction timing zSolving these problems independently helps simplify analysis. yEasier to separate on simpler CPUs. zAccurate performance analysis requires: yAssembly/binary code. yExecution platform.

7 Data-dependent paths in an if statement if (a || b) { /* T1 */ if ( c ) /* T2 */ x = r*s+t; /* A1 */ else y=r+s; /* A2 */ z = r+s+u; /* A3 */ } else { if ( c ) /* T3 */ y = r-t; /* A4 */ } abcpath 000 T1=F, T3=F: no assignments 001 T1=F, T3=T: A4 010 T1=T, T2=F: A2, A3 011 T1=T, T2=T: A1, A3 100 T1=T, T2=F: A2, A3 101 T1=T, T2=T: A1, A3 110 T1=T, T2=F: A2, A3 111 T1=T, T2=T: A1, A3

8 Paths in a loop for (i=0, f=0; i<N; i++) f = f + c[i] * x[i]; i=0 f=0 i<N f = f + c[i] * x[i] i = i + 1 N Y

9 Instruction timing zNot all instructions take the same amount of time. yMulti-cycle instructions. yFetches. zExecution times of instructions are not independent. yPipeline interlocks. yCache effects. zExecution times may vary with operand value. yFloating-point operations. ySome multi-cycle integer operations.

10 Mesaurement-driven performance analysis zNot so easy as it sounds: yMust actually have access to the CPU. yMust know data inputs that give worst/best case performance. yMust make state visible. zStill an important method for performance analysis.

11 Feeding the program zNeed to know the desired input values. zMay need to write software scaffolding( 软 件框架 ) to generate the input values. zSoftware scaffolding may also need to examine outputs to generate feedback- driven inputs.

12 Trace-driven measurement zTrace-driven: yInstrument the program. ySave information about the path. zRequires modifying the program. zTrace files are large. zWidely used for cache analysis.

13 Physical measurement zProgram counter’s value yStart a timer when a program starts and stop this timer when the program stops yPossible to modify the program zLogic analyzer can measure behavior at pins. yAddress bus can be analyzed to look for events. yCode can be modified to make events visible. zParticularly important for real-world input streams.

14 CPU simulation zSome simulators are less accurate. zCycle-accurate simulator provides accurate clock-cycle timing. ySimulator models CPU internals. ySimulator writer must know how CPU works.

15 SimpleScalar FIR filter simulation int x[N] = {8, 17, … }; int c[N] = {1, 2, … }; main() { int i, k, f; for (k=0; k<COUNT; k++) for (i=0; i<N; i++) f += c[i]*x[i]; } Ntotal sim cycles sim cycles per filter execution 10025854259 1,000155759156 1,00001451840145

16 Performance optimization motivation zEmbedded systems must often meet deadlines. yFaster may not be fast enough. zNeed to be able to analyze execution time. yWorst-case, not typical. zNeed techniques for reliably improving execution time.

17 Programs and performance analysis zBest results come from analyzing optimized instructions, not high-level language code: ynon-obvious translations of HLL statements into instructions; ycode may move; ycache effects are hard to predict.

18 Loop optimizations zLoops are good targets for optimization. zBasic loop optimizations: ycode motion; (代码移出) yinduction-variable elimination; (归纳变量消 除) ystrength reduction (强度消减) (x*2 -> x<<1).

19 Code motion for (i=0; i<N*M; i++) z[i] = a[i] + b[i]; i<N*M i=0; z[i] = a[i] + b[i]; i = i+1; N Y i<X i=0; X = N*M

20 Induction variable elimination zInduction variable: loop index. zConsider loop: for (i=0; i<N; i++) for (j=0; j<M; j++) z[i,j] = b[i,j]; zRather than recompute i*M+j for each array in each iteration, share induction variable between arrays, increment at end of loop body.

21 Induction variable elimination (cont’) for (i=0; i<N; i++) for (j=0; j<M; j++) {bz=i*M+j; *(zp+bz) = *(zb+bz); } Bz=0 for (i=0; i<N; i++) for (j=0; j<M; j++) { *(zp+bz) = *(zb+bz); bz++; }

22 Strength reduction zy=2*x zy=x<<1

23 Cache analysis zLoop nest: set of loops, one inside other. zPerfect loop nest: no conditionals in nest. zBecause loops use large quantities of data, cache conflicts are common.

24 Array conflicts in cache a[0,0] b[0,0] main memory cache 10244099... 1024 4099

25 Array conflicts, cont’d. zArray elements conflict because they are in the same line, even if not mapped to same location. zSolutions: ymove one array; ypad array.

26 Performance optimization hints zUse registers efficiently. zUse page mode memory accesses. zAnalyze cache behavior: yinstruction conflicts can be handled by rewriting code, rescheudling; yconflicting scalar data can easily be moved; yconflicting array data can be moved, padded.

27 Energy/power optimization zEnergy: ability to do work. yMost important in battery-powered systems. zPower: energy per unit time. yImportant even in wall-plug systems---power becomes heat.

28 Measuring energy consumption zExecute a small loop, measure current: while (TRUE) a(); I

29 Sources of energy consumption zRelative energy per operation (Catthoor et al): ymemory transfer: 33 yexternal I/O: 10 ySRAM write: 9 ySRAM read: 4.4 ymultiply: 3.6 yadd: 1

30 Cache behavior is important zEnergy consumption has a sweet spot as cache size changes: ycache too small: program thrashes, burning energy on external memory accesses; ycache too large: cache itself burns too much power.

31 Cache sweet spot [Li98] © 1998 IEEE

32 Optimizing for energy zFirst-order optimization: yhigh performance = low energy. zNot many instructions trade speed for energy.

33 Optimizing for energy, cont’d. zUse registers efficiently. zIdentify and eliminate cache conflicts. zModerate loop unrolling eliminates some loop overhead instructions. zEliminate pipeline stalls. zInlining procedures may help: reduces linkage, but may increase cache thrashing.

34 Efficient loops zGeneral rules: yDon’t use function calls. yKeep loop body small to enable local repeat (only forward branches). yUse unsigned integer for loop counter. yUse <= to test loop counter. yMake use of compiler---global optimization, software pipelining.

35 Optimizing for program size zGoal: yreduce hardware cost of memory; yreduce power consumption of memory units. zTwo opportunities: ydata; yinstructions.

36 Data size minimization zReuse constants, variables, data buffers in different parts of code. yRequires careful verification of correctness. zGenerate data using instructions.

37 Reducing code size zAvoid function inlining. zChoose CPU with compact instructions. zUse specialized instructions where possible.

38 Program validation and testing zBut does it work? zConcentrate here on functional verification. zMajor testing strategies: yBlack box doesn’t look at the source code. yClear box (white box) does look at the source code.

39 Clear-box testing zExamine the source code to determine whether it works: yCan you actually exercise a path? yDo you get the value you expect along a path? zTesting procedure: yControllability: provide program with inputs. yExecute. yObservability: examine outputs.

40 Controlling and observing programs firout = 0.0; for (j=curr, k=0; j<N; j++, k++) firout += buff[j] * c[k]; for (j=0; j<curr; j++, k++) firout += buff[j] * c[k]; if (firout > 100.0) firout = 100.0; if (firout < -100.0) firout = -100.0; z Controllability: yMust fill circular buffer with desired N values. yOther code governs how we access the buffer. z Observability: yWant to examine firout before limit testing.

41 Execution paths and testing zPaths are important in functional testing as well as performance analysis. zIn general, an exponential number of paths through the program. yShow that some paths dominate others. yHeuristically limit paths.

42 Choosing the paths to test zPossible criteria: yExecute every statement at least once. yExecute every branch direction at least once. zEquivalent for structured programs. zNot true for gotos. not covered

43 Basis paths z Approximate CDFG with undirected graph. z Undirected graphs have basis paths: yAll paths are linear combinations of basis paths.

44 Cyclomatic complexity ( 环路复杂度 ) z a bound on the size of basis sets: ye = # edges yn = # nodes yp = number of graph components yM = e – n + 2p. z For a structured program: yThe number>2 of binary decisions in the flow graph and add adding 1. yThe switch statement is the number of switch-1.

45 Branch testing zHeuristic for testing branches. yExercise true and false branches of conditional. yExercise every simple condition at least once.

46 Branch testing example zCorrect: yif (a || (b >= c)) { printf(“OK\n”); } zIncorrect: yif (a && (b >= c)) { printf(“OK\n”); } z Test: ya = F y(b >=c) = T z Example: yCorrect: [0 || (3 >= 2)] = T yIncorrect: [0 && (3 >= 2)] = F

47 Another branch testing example zCorrect: yif ((x == good_pointer) && x->field1 == 3)) { printf(“got the value\n”); } zIncorrect: zif ((x = good_pointer) && x->field1 == 3)) { printf(“got the value\n”); } z Incorrect code changes pointer. yAssignment returns new LHS in C. z Test that catches error: y(x != good_pointer) && x->field1 = 3)

48 Domain testing zHeuristic test for linear inequalities. zTest on each side + boundary of inequality.

49 Def-use pairs zVariable def-use: yDef when value is assigned (defined). yUse when used on right-hand side. zExercise each def-use pair. yRequires testing correct path.

50 Loop testing zLoops need specialized tests to be tested efficiently. zHeuristic testing strategy: ySkip loop entirely. yOne loop iteration. yTwo loop iterations. y# iterations much below max. yn-1, n, n+1 iterations where n is max.

51 Black-box testing zComplements clear-box testing. yMay require a large number of tests. zTests software in different ways.

52 Black-box test vectors zRandom tests. yMay weight distribution based on software specification. zRegression tests.( 回归测试 ) yTests of previous versions, bugs, etc. yMay be clear-box tests of previous versions.

53 How much testing is enough? zExhaustive testing is impractical. zOne important measure of test quality---bugs escaping into field. zGood organizations can test software to give very low field bug report rates. zError injection measures test quality: yAdd known bugs. yRun your tests. yDetermine 100% injected bugs that are caught.

54 homework P181 5-4(c), 5-5(e), 5-6(a), 5-8(b), 5-12, 5-13, 5-19(a) © 2000 Morgan Kaufman Overheads for Computers as Components


Download ppt "Program design and analysis zProgram-level performance analysis. zOptimizing for: yExecution time. yEnergy/power. yProgram size. zProgram validation and."

Similar presentations


Ads by Google