Presentation is loading. Please wait.

Presentation is loading. Please wait.

Loop Optimizations Scheduling. Loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; } for (int i = 0; i < n; ++i) { b[i] +=

Similar presentations


Presentation on theme: "Loop Optimizations Scheduling. Loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; } for (int i = 0; i < n; ++i) { b[i] +="— Presentation transcript:

1 Loop Optimizations Scheduling

2 Loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; } for (int i = 0; i < n; ++i) { b[i] += a[i]; } Int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; // Will DCE pick this up? b[i] += acc; }

3 Loop fission for (int I = 0; i < n; ++i) { a[i] = e1; b[i] = e2; // e1 and e2 independent } for (int I = 0; i < n; ++i) { a[i] = e1; } for (int I = 0; i < n; ++i) { b[i] = e2; }

4 Loop unrolling for (int i = 0; i < n; ++i) { a[i] = b[i] * 7 + c[i] / 13; } for (int i = 0; i < n % 3; ++i) { a[i] = b[i] * 7 + c[i] / 13; } for (; i < n; i += 3) { a[i] = b[i] * 7 + c[i] / 13; a[i + 1] = b[i + 1] * 7 + c[i + 1] / 13; a[i + 2] = b[i + 2] * 7 + c[i + 2] / 13; }

5 Loop interchange for (int i = 0; i < n; ++i) { for (int j = 0; j < n; ++j) { a[i][j] += 1; } } for (int j = 0; j < n; ++j) { for (int i = 0; i < n; ++i) { a[i][j] += 1; } }

6 Loop peeling for (int i = 0; i < n; ++i) { b[i] = (i == 0) ? a[i] : a[i] + b[i-1]; } b[0] = a[0]; for (int i = 1; i < n; ++i) { b[i] = a[i] + b[i-1]; }

7 Loop tiling for (int i = 0; i < n; ++i) { for (int j = 0; j < n; ++j) { for (int k = 0; k < n; ++k) { c[i][j] += a[i][k] * b[k][j]; } } } Very roughly: (need outer loops to move y and z) for (int i = y; i < y + 10; ++i) { for (int j = z; j < z + 10; ++j) { for (int k = 0; k < n; ++k) { c[i][j] += a[i][k] * b[k][j]; } } }

8 Loop parallelization for (int i = 0; i < n; ++i) { a[i] = b[i] + c[i]; // a, b, and c do not overlap } for (int i = 0; i < n % 4; ++i) a[i] = b[i] + c[i]; for (; i < n; i = i + 4) { __some4SIMDadd(a+i,b+i,c+i); }

9 Instruction scheduling An instruction goes through the processor pipeline in one or more cycles Several instructions can be processed simultaneously at different stages in the pipeline The number of cycles necessary to process an instruction is called its latency Examples of instruction latency on some x86 – ADD: 1 cycle – MUL: 4 cycles – DIV (32 bits): 40 cycles The simplest for of scheduling is done per CFG blocks

10 Instruction scheduling: example 1: ADD R1, R2 2: MUL R3, R4 3: 4: 5: 6: ADD R1, R3 1: MUL R3, R4 2: ADD R1, R2 3: 4: 5: ADD R1, R3

11 Beyond blocks: trace scheduling 1

12 Beyond blocks: trace scheduling 2

13 Pipelining for loops First idea: unrolling the loop and then scheduling – It works, but it is not always optimal, and increase the code size Think of a loop with the following body: – DIV R1, R3 ; ADD R1, R2 – We would have to unroll 40 times to hide the latency – And in general, it may not always be possible to hide the latency – What if the DIV was computing the value for 40 iterations from now? Software pipelining

14 Software pipelining 1 There is one last technique in the arsenal of the software optimizer that may be used to make most machines run at tip top speed. It can also lead to severe code bloat and may make for almost unreadable code, so should be considered the last refuge of the truly desperate. However, its performance characteristics are in many cases unmatched by any other approach, so we cover it here. It is called software pipelining [... ] Apple Developer Connection

15 Software pipelining 2

16 Symbolic evaluation Turning sequence of instructions back to expressions Hides some of the syntactic details Example: add a b c ; add d a b ; add e d a becomes a -> add(b,c) d -> add(add(b,c),b) e -> add(add(add(b,c),b),add(b,c)) For example, it is insensitive to the order of independent instructions

17 Is my pipeline correct? Let s(I) denotes the symbolic evaluation of the block of instructions I Let o be the composition of symbolic trees If s(P o E) = s(B  ) and s(S o E) = s(E o B) then the pipeline is correct (the converse is not true)


Download ppt "Loop Optimizations Scheduling. Loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; } for (int i = 0; i < n; ++i) { b[i] +="

Similar presentations


Ads by Google