Loop Optimizations Scheduling. Loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; } for (int i = 0; i < n; ++i) { b[i] +=

Loop Optimizations Scheduling

Loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; } for (int i = 0; i < n; ++i) { b[i] += a[i]; } Int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; // Will DCE pick this up? b[i] += acc; }

Loop fission for (int I = 0; i < n; ++i) { a[i] = e1; b[i] = e2; // e1 and e2 independent } for (int I = 0; i < n; ++i) { a[i] = e1; } for (int I = 0; i < n; ++i) { b[i] = e2; }

Loop unrolling for (int i = 0; i < n; ++i) { a[i] = b[i] * 7 + c[i] / 13; } for (int i = 0; i < n % 3; ++i) { a[i] = b[i] * 7 + c[i] / 13; } for (; i < n; i += 3) { a[i] = b[i] * 7 + c[i] / 13; a[i + 1] = b[i + 1] * 7 + c[i + 1] / 13; a[i + 2] = b[i + 2] * 7 + c[i + 2] / 13; }

Loop interchange for (int i = 0; i < n; ++i) { for (int j = 0; j < n; ++j) { a[i][j] += 1; } } for (int j = 0; j < n; ++j) { for (int i = 0; i < n; ++i) { a[i][j] += 1; } }

Loop peeling for (int i = 0; i < n; ++i) { b[i] = (i == 0) ? a[i] : a[i] + b[i-1]; } b[0] = a[0]; for (int i = 1; i < n; ++i) { b[i] = a[i] + b[i-1]; }

Loop tiling for (int i = 0; i < n; ++i) { for (int j = 0; j < n; ++j) { for (int k = 0; k < n; ++k) { c[i][j] += a[i][k] * b[k][j]; } } } Very roughly: (need outer loops to move y and z) for (int i = y; i < y + 10; ++i) { for (int j = z; j < z + 10; ++j) { for (int k = 0; k < n; ++k) { c[i][j] += a[i][k] * b[k][j]; } } }

Loop parallelization for (int i = 0; i < n; ++i) { a[i] = b[i] + c[i]; // a, b, and c do not overlap } for (int i = 0; i < n % 4; ++i) a[i] = b[i] + c[i]; for (; i < n; i = i + 4) { __some4SIMDadd(a+i,b+i,c+i); }

Instruction scheduling An instruction goes through the processor pipeline in one or more cycles Several instructions can be processed simultaneously at different stages in the pipeline The number of cycles necessary to process an instruction is called its latency Examples of instruction latency on some x86 – ADD: 1 cycle – MUL: 4 cycles – DIV (32 bits): 40 cycles The simplest for of scheduling is done per CFG blocks

Instruction scheduling: example 1: ADD R1, R2 2: MUL R3, R4 3: 4: 5: 6: ADD R1, R3 1: MUL R3, R4 2: ADD R1, R2 3: 4: 5: ADD R1, R3

Beyond blocks: trace scheduling 1

Beyond blocks: trace scheduling 2

Pipelining for loops First idea: unrolling the loop and then scheduling – It works, but it is not always optimal, and increase the code size Think of a loop with the following body: – DIV R1, R3 ; ADD R1, R2 – We would have to unroll 40 times to hide the latency – And in general, it may not always be possible to hide the latency – What if the DIV was computing the value for 40 iterations from now? Software pipelining

Software pipelining 1 There is one last technique in the arsenal of the software optimizer that may be used to make most machines run at tip top speed. It can also lead to severe code bloat and may make for almost unreadable code, so should be considered the last refuge of the truly desperate. However, its performance characteristics are in many cases unmatched by any other approach, so we cover it here. It is called software pipelining [... ] Apple Developer Connection

Software pipelining 2

Symbolic evaluation Turning sequence of instructions back to expressions Hides some of the syntactic details Example: add a b c ; add d a b ; add e d a becomes a -> add(b,c) d -> add(add(b,c),b) e -> add(add(add(b,c),b),add(b,c)) For example, it is insensitive to the order of independent instructions

Is my pipeline correct? Let s(I) denotes the symbolic evaluation of the block of instructions I Let o be the composition of symbolic trees If s(P o E) = s(B  ) and s(S o E) = s(E o B) then the pipeline is correct (the converse is not true)

Loop Optimizations Scheduling. Loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; } for (int i = 0; i < n; ++i) { b[i] +=

Similar presentations

Presentation on theme: "Loop Optimizations Scheduling. Loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; } for (int i = 0; i < n; ++i) { b[i] +="— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Loop Optimizations Scheduling. Loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; } for (int i = 0; i < n; ++i) { b[i] +=

Similar presentations

Presentation on theme: "Loop Optimizations Scheduling. Loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; } for (int i = 0; i < n; ++i) { b[i] +="— Presentation transcript:

Similar presentations

About project

Feedback