Presentation is loading. Please wait.

Presentation is loading. Please wait.

Optimizing program performance

Similar presentations


Presentation on theme: "Optimizing program performance"— Presentation transcript:

1 Optimizing program performance
Computer Systems Optimizing program performance Arnoud Visser Computer Systems – optimizing program performance

2 Performance can make the difference
Use Pointers instead of array indices Use doubles instead of floats Optimize inner loops Recommendations Patrick van der Smagt in 1991 for neural net implementations Arnoud Visser Computer Systems – optimizing program performance

3 Machine-independent versus Machine-dependent optimizations
Optimizations you should do regardless of processor / compiler Code Motion (out of the loop) Reducing procedure calls Unneeded Memory usage Share Common sub-expressions Machine-Dependent Optimizations Pointer code Unrolling Enabling instruction level parallelism Arnoud Visser Computer Systems – optimizing program performance

4 Computer Systems – optimizing program performance
Machine dependent One has to known today’s architectures Superscalar (Pentium) (often two instructions/cycle) Dynamic execution (P6) (three instructions out-of-order/cycle) Explicit parallelism (Itanium) (six execution units) Arnoud Visser Computer Systems – optimizing program performance

5 Computer Systems – optimizing program performance
Pentium III Design Instruction Control Address Fetch Control Instruction Cache Retirement Unit Instrs. Register File Instruction Decode Operations Register Updates Prediction OK? Execution Functional Units Integer/ Branch General Integer FP Add FP Mult/Div Load Store Operation Results Addr. Addr. Data Data Data Cache Arnoud Visser Computer Systems – optimizing program performance

6 Functional Units of Pentium III
Multiple Instructions Can Execute in Parallel 1 load 1 store 2 integer (one may be branch) 1 FP Addition 1 FP Multiplication or Division Arnoud Visser Computer Systems – optimizing program performance

7 Performance of Pentium III operations
Many instructions can be Pipelined to 1 cycle Instruction Latency Cycles/Issue Load / Store Integer Multiply Integer Divide Double/Single FP Multiply Double/Single FP Add Double/Single FP Divide Arnoud Visser Computer Systems – optimizing program performance

8 Computer Systems – optimizing program performance
Instruction Control Instruction Control Instruction Cache Fetch Control Decode Address Instrs. Operations Retirement Unit Register File Grabs Instructions From Memory Based on current PC + predicted targets for predicted branches Hardware dynamically guesses (possibly) branch target Translates Instructions Into Operations Primitive steps required to perform instruction Typical instruction requires 1–3 operations Converts Register References Into Tags Abstract identifier linking destination of one operation with sources of later operations Arnoud Visser Computer Systems – optimizing program performance

9 Computer Systems – optimizing program performance
Translation Example Version of Combine4 Integer data, multiply operation Translation of First Iteration .L24: # Loop: imull (%eax,%edx,4),%ecx # t *= data[i] incl %edx # i++ cmpl %esi,%edx # i:length jl .L24 # if < goto Loop .L24: imull (%eax,%edx,4),%ecx incl %edx cmpl %esi,%edx jl .L24 Split into two operations load reads from memory to generate temporary result t.1 Multiply operation just operates on registers Operands Registers %eax does not change in loop. Values will be retrieved from register file during decoding Register %ecx changes on every iteration. Uniquely identify different versions as %ecx.0, %ecx.1, %ecx.2, … Register renaming Values passed directly from producer to consumers load (%eax,%edx.0,4)  t.1 imull t.1, %ecx.0  %ecx.1 incl %edx.0  %edx.1 cmpl %esi, %edx.1  cc.1 jl-taken cc.1 Arnoud Visser Computer Systems – optimizing program performance

10 Visualizing Operations
load (%eax,%edx,4)  t.1 imull t.1, %ecx.0  %ecx.1 incl %edx.0  %edx.1 cmpl %esi, %edx.1  cc.1 jl-taken cc.1 cc.1 t.1 load %ecx.1 incl cmpl jl %edx.0 %edx.1 %ecx.0 imull Time Operations Vertical position denotes time at which executed Cannot begin operation until operands available Height denotes latency Operands Arcs shown only for operands that are passed within execution unit Arnoud Visser Computer Systems – optimizing program performance

11 4 Iterations of Combining Sum
4 integer ops Resource Analysis Performance Unlimited resources should give CPE of 1.0 Would require executing 4 integer operations in parallel Arnoud Visser Computer Systems – optimizing program performance

12 Pentium Resource Constraints
Only two integer functional units Set priority based on program order Performance Sustain CPE of 2.0 Arnoud Visser Computer Systems – optimizing program performance

13 Computer Systems – optimizing program performance
Loop Unrolling Measured CPE=1.33 void combine5(vec_ptr v, int *dest) { int length = vec_length(v); int limit = length-2; int *data = get_vec_start(v); int sum = 0; int i; /* Combine 3 elements at a time */ for (i = 0; i < limit; i+=3) { sum += data[i] + data[i+2] + data[i+1]; } /* Finish any remaining elements */ for (; i < length; i++) { sum += data[i]; *dest = sum; Optimization Combine multiple iterations into single loop body Amortizes loop overhead across multiple iterations Finish extras at end Arnoud Visser Computer Systems – optimizing program performance

14 Resource distribution with Loop Unrolling
Predicted Performance Can complete iteration in 3 cycles Should give CPE of 1.0 Arnoud Visser Computer Systems – optimizing program performance

15 Computer Systems – optimizing program performance
Effect of Unrolling Unrolling Degree 1 2 3 4 8 16 Integer Sum 2.00 1.50 1.33 1.25 1.06 Product 4.00 FP 3.00 5.00 Only helps integer sum for our examples Other cases constrained by functional unit latencies Effect is nonlinear with degree of unrolling Many subtle effects determine exact scheduling of operations Arnoud Visser Computer Systems – optimizing program performance

16 Unrolling is for long vectors
Unrolling Degree 1 2 3 4 8 16 IntSum 2.00 1.50 1.33 1.25 1.06 1024 2.06 1.56 1.40 1.31 1.12 31 4.02 3.57 3.39 3.84 3.91 3.66 New source of overhead The need to finish the remaining elements when the vector length is not divisible by the degree of unrolling Arnoud Visser Computer Systems – optimizing program performance

17 3 Iterations of Combining Product
Unlimited Resource Analysis Assume operation can start as soon as operands available Performance Limiting factor becomes latency Gives CPE of 4.0 Arnoud Visser Computer Systems – optimizing program performance

18 Computer Systems – optimizing program performance
Iteration splitting void combine6(vec_ptr v, int *dest) { int length = vec_length(v); int limit = length-1; int *data = get_vec_start(v); int x0 = 1; int x1 = 1; int i; /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x0 *= data[i]; x1 *= data[i+1]; } /* Finish any remaining elements */ for (; i < length; i++) { *dest = x0 * x1; Optimization Make operands available by accumulating in two different products (x0,x1) Combine at end Performance CPE = 2.0 Arnoud Visser Computer Systems – optimizing program performance

19 Resource distribution with Iteration Splitting
Vraag: wat klopt hier niet: derde jl geeft conflict met eerste imull Predicted Performance Make use of both execution units Gives CPE of 2.0 Arnoud Visser Computer Systems – optimizing program performance

20 Results for Pentium III
Biggest gain doing basic optimizations But, last little bit helps Arnoud Visser Computer Systems – optimizing program performance

21 Computer Systems – optimizing program performance
Results for Pentium 4 Higher latencies (int * = 14, fp + = 5.0, fp * = 7.0) Clock runs at 2.0 GHz Not an improvement over 1.0 GHz P3 for integer * Avoids FP multiplication anomaly Arnoud Visser Computer Systems – optimizing program performance

22 Computer Systems – optimizing program performance
Machine-Dependent Opt. Summary Loop Unrolling Some compilers do this automatically Generally not as clever as what can achieve by hand Exposing Instruction-Level Parallelism Very machine dependent Warning: Benefits depend heavily on particular machine Do only for performance-critical parts of code Best if performed by compiler But GCC on IA32/Linux is not very good Pointer Code Look carefully at generated code to see whether helpful Arnoud Visser Computer Systems – optimizing program performance

23 Computer Systems – optimizing program performance
Conclusion How should I write my programs, given that I have a good, optimizing compiler? Don’t: Smash Code into Oblivion Hard to read, maintain, & assure correctness Do: Select best algorithm & data representation Write code that’s readable & maintainable Procedures, recursion, without built-in constant limits Even though these factors can slow down code Focus on Inner Loops Detailed optimization means detailed measurement Arnoud Visser Computer Systems – optimizing program performance

24 Computer Systems – optimizing program performance
Assignment Practice Problems Practice Problem 5.5: 'What program variable has been spilled for combine6()?‘ Practice Problem 5.6: 'Indicate the CPE of different associated products.' Optimization Lab Arnoud Visser Computer Systems – optimizing program performance

25 Performance of Pentium Core 2 operations
Two stores and one load in a single cycle. The performance of IDIV depends on the quotient Instruction Latency Cycles/Issue Load / Store Integer Multiply Integer Divide Double/Single FP Multiply Double/Single FP Add Double/Single FP Divide Table C13a en C14a (Pentium Core 2 = cpuid 06_0FH) The latency and throughput of IDIV in Intel Core microarchitecture varies with the number of significant digits of the quotient of the division. Latency and throughput of IDIV may increases with the number of significant digit of the quotient. The latency and throughput of IDIV with 64-bit operand are significantly slower than those with 32-bit operand Arnoud Visser Computer Systems – optimizing program performance


Download ppt "Optimizing program performance"

Similar presentations


Ads by Google