Download presentation
Presentation is loading. Please wait.
Published byAugust Newman Modified over 9 years ago
1
Loop Optimizations Scheduling
2
Loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; } for (int i = 0; i < n; ++i) { b[i] += a[i]; } Int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; // Will DCE pick this up? b[i] += acc; }
3
Loop fission for (int I = 0; i < n; ++i) { a[i] = e1; b[i] = e2; // e1 and e2 independent } for (int I = 0; i < n; ++i) { a[i] = e1; } for (int I = 0; i < n; ++i) { b[i] = e2; }
4
Loop unrolling for (int i = 0; i < n; ++i) { a[i] = b[i] * 7 + c[i] / 13; } for (int i = 0; i < n % 3; ++i) { a[i] = b[i] * 7 + c[i] / 13; } for (; i < n; i += 3) { a[i] = b[i] * 7 + c[i] / 13; a[i + 1] = b[i + 1] * 7 + c[i + 1] / 13; a[i + 2] = b[i + 2] * 7 + c[i + 2] / 13; }
5
Loop interchange for (int i = 0; i < n; ++i) { for (int j = 0; j < n; ++j) { a[i][j] += 1; } } for (int j = 0; j < n; ++j) { for (int i = 0; i < n; ++i) { a[i][j] += 1; } }
6
Loop peeling for (int i = 0; i < n; ++i) { b[i] = (i == 0) ? a[i] : a[i] + b[i-1]; } b[0] = a[0]; for (int i = 1; i < n; ++i) { b[i] = a[i] + b[i-1]; }
7
Loop tiling for (int i = 0; i < n; ++i) { for (int j = 0; j < n; ++j) { for (int k = 0; k < n; ++k) { c[i][j] += a[i][k] * b[k][j]; } } } Very roughly: (need outer loops to move y and z) for (int i = y; i < y + 10; ++i) { for (int j = z; j < z + 10; ++j) { for (int k = 0; k < n; ++k) { c[i][j] += a[i][k] * b[k][j]; } } }
8
Loop parallelization for (int i = 0; i < n; ++i) { a[i] = b[i] + c[i]; // a, b, and c do not overlap } for (int i = 0; i < n % 4; ++i) a[i] = b[i] + c[i]; for (; i < n; i = i + 4) { __some4SIMDadd(a+i,b+i,c+i); }
9
Instruction scheduling An instruction goes through the processor pipeline in one or more cycles Several instructions can be processed simultaneously at different stages in the pipeline The number of cycles necessary to process an instruction is called its latency Examples of instruction latency on some x86 – ADD: 1 cycle – MUL: 4 cycles – DIV (32 bits): 40 cycles The simplest for of scheduling is done per CFG blocks
10
Instruction scheduling: example 1: ADD R1, R2 2: MUL R3, R4 3: 4: 5: 6: ADD R1, R3 1: MUL R3, R4 2: ADD R1, R2 3: 4: 5: ADD R1, R3
11
Beyond blocks: trace scheduling 1
12
Beyond blocks: trace scheduling 2
13
Pipelining for loops First idea: unrolling the loop and then scheduling – It works, but it is not always optimal, and increase the code size Think of a loop with the following body: – DIV R1, R3 ; ADD R1, R2 – We would have to unroll 40 times to hide the latency – And in general, it may not always be possible to hide the latency – What if the DIV was computing the value for 40 iterations from now? Software pipelining
14
Software pipelining 1 There is one last technique in the arsenal of the software optimizer that may be used to make most machines run at tip top speed. It can also lead to severe code bloat and may make for almost unreadable code, so should be considered the last refuge of the truly desperate. However, its performance characteristics are in many cases unmatched by any other approach, so we cover it here. It is called software pipelining [... ] Apple Developer Connection
15
Software pipelining 2
16
Symbolic evaluation Turning sequence of instructions back to expressions Hides some of the syntactic details Example: add a b c ; add d a b ; add e d a becomes a -> add(b,c) d -> add(add(b,c),b) e -> add(add(add(b,c),b),add(b,c)) For example, it is insensitive to the order of independent instructions
17
Is my pipeline correct? Let s(I) denotes the symbolic evaluation of the block of instructions I Let o be the composition of symbolic trees If s(P o E) = s(B ) and s(S o E) = s(E o B) then the pipeline is correct (the converse is not true)
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.