CMPUT 329 - Computer Organization and Architecture II1 CMPUT680 - Winter 2006 Topic I: Superblock and Hyperblock Formation José Nelson Amaral

CMPUT 329 - Computer Organization and Architecture II1 CMPUT680 - Winter 2006 Topic I: Superblock and Hyperblock Formation José Nelson Amaral http://www.cs.ualberta.ca/~amaral/courses/680

CMPUT 329 - Computer Organization and Architecture II2 Instruction Level Parallelism Optimizations The objective of an optimizer is to reduce the number and complexity of the instructions executed by the processor. Superscalar or Very Long Instruction Word (VLIW) processors can reduce the execution time even when the number of instructions executed moderately increases, as long as the dependence height is reduced.

CMPUT 329 - Computer Organization and Architecture II3 Speculative and Predicated Execution Speculative Execution: execution of an instruction before knowing that its execution is required. Predicated Execution: architecture-supported conditional execution of an instruction based on the value of a Boolean source operand, referred to as the predicate of the instruction. Superblock: structure used to implement compiler-controlled speculative execution. If-conversion: compiler algorithm that converts conditional branches into predicate-defining instructions to allow the use of predication.

CMPUT 329 - Computer Organization and Architecture II4 Trace Scheduling (Fisher, 1981) Some optimization and scheduling decisions may decrease the execution time for one control path while increasing the execution time for another path. Thus decisions should favor more frequently executed paths to improve overall performance. Trace scheduling divides a procedure in a set of frequently executed traces (paths).

CMPUT 329 - Computer Organization and Architecture II5 Trace Scheduling There may be conditional branches from the middle of the trace (side exits) and transitions from other traces into the middle of the trace (side entrances). These control-flow transitions are ignored during trace scheduling. After scheduling, bookeeping is required to ensure the correct execution of off-trace code.

CMPUT 329 - Computer Organization and Architecture II6 Bookeeping for Trace Scheduling Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 2 Instr 3 Instr 4 Instr 1 Instr 5 What bookeeping is required when Instr 1 is moved below the side entrance in the trace?

CMPUT 329 - Computer Organization and Architecture II7 Bookeeping for Trace Scheduling Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 2 Instr 3 Instr 4 Instr 1 Instr 5 Instr 3 Instr 4

CMPUT 329 - Computer Organization and Architecture II8 Bookeeping for Trace Scheduling Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 1 Instr 5 Instr 2 Instr 3 Instr 4 What bookeeping is required when Instr 5 moves above the side entrance in the trace?

CMPUT 329 - Computer Organization and Architecture II9 Bookeeping for Trace Scheduling Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 1 Instr 5 Instr 2 Instr 3 Instr 4 Instr 5

CMPUT 329 - Computer Organization and Architecture II10 Superblocks A superblock is a trace without side entrances, i.e., control can only enter from the top, but it can leave at one or more exit points. The formation of superblocks creates additional optimization opportunities because constraints associated with infrequently executed paths of control are ignored (thus these constraints do not inhibit optimizations that favor frequently executed paths).

CMPUT 329 - Computer Organization and Architecture II11 Superblock Formation (Example) Y D 100 C 10 B 90 E 90 D0D0 F 100 Z 1 9010 90 0 0 10 99 1 Y D 100 C 10 B 90 E 90 D0D0 F 100 Z 1 90 10 90 0 0 10 99 1

CMPUT 329 - Computer Organization and Architecture II12 Superblock Formation (Example) Y D 100 C 10 B 90 E 90 D0D0 F 100 Z 1 90 10 90 0 0 10 99 1 Is this a superblock? No, a superblock cannot have side entrances, and this set of nodes has two side entrances into node F. How do we convert it into a superblock?

CMPUT 329 - Computer Organization and Architecture II13 Superblock Formation (Example) Y D 100 C 10 B 90 E 90 D0D0 F 90 Z 1 10 90 0 0 10 89.1 0.9 Tail duplication, is the duplication of basic blocks that appear after a side entrance to eliminate side entrances and transform a trace into a superblock. F’ 10 9.9 0.1

CMPUT 329 - Computer Organization and Architecture II14 Common Subexpression Elimination in Superblocks opA: mul r1,r2,3 opC: mul r3,r2,3 opB: add r2,r2,1 99 1 1 Original Code opA: mul r1,r2,3 opC: mul r3,r2,3 opB: add r2,r2,1 99 1 Code After Superblock Formation opC’: mul r3,r2,3 opA: mul r1,r2,3 opC: mov r3,r1 opB: add r2,r2,1 99 1 Code After Common Subexpression Elimination opC’: mul r3,r2,3

CMPUT 329 - Computer Organization and Architecture II15 Operation Migration in Superblocks Original Code … mov r0,r1 … mov r0,r2 … mov r0,r3 … add r1,r1,4 add r2,r2,4 add r3,r3,4 X Y Z After Operation Migration … … … … add r1,r1,4 add r2,r2,4 add r3,r3,4 mov r0,r1 mov r0,r2 mov r0,r3 X Y Z

CMPUT 329 - Computer Organization and Architecture II16 Global Variable Migration in Superblock Loops OpA: ld_I r4, x, r0 OpB: add r4, r4, r1 OpC: st_I x, r0, r4 OpD: add r1, r1, 1 OpE: add r0, r0, 1 100 Original Program Segment 0 10 20 30 MEM[r0+x] r4 1 r1 1 r0

CMPUT 329 - Computer Organization and Architecture II17 Global Variable Migration in Superblock Loops OpA: ld_I r4, x, r0 OpB: add r4, r4, r1 OpC: st_I x, r0, r4 OpD: add r1, r1, 1 OpE: add r0, r0, 1 100 Original Program Segment 0 10 20 30 MEM[r0+x] 10 r4 1 r1 1 r0

CMPUT 329 - Computer Organization and Architecture II28 Global Variable Migration in Superblock Loops OpA: ld_I r4, x, r0 OpB: add r4, r4, r1 OpC: st_I x, r0, r4 OpD: add r1, r1, 1 OpE: add r0, r0, 1 100 Original Program Segment 0 OpC: st_i x, r0, r4 OpC’: st_i x, r0, r4 OpE: add r0, r0, 1 OpA: ld_I r4, x, r0 OpB: add r4, r4, r1 OpD: add r1, r1, 1 100 After Variable Migration 0

CMPUT 329 - Computer Organization and Architecture II29 Superblock Enlarging Optimizations By enlarging a superblock, we can provide the scheduler with more independent instructions to choose from for each cycle Superblock enlarging optimizations: Branch target expansion Loop unrolling Loop peeling

CMPUT 329 - Computer Organization and Architecture II30 Branch Target Expansion Idea: To expand the superblock with the target of a likely taken branch. blt r1, r2, L3 beq r3, r4, L5 L1: jump L4 L2: L3: 20100 blt r1, r2, L3 beq r3, r4, L5 L1: jump L4 L2: 20

CMPUT 329 - Computer Organization and Architecture II31 Superblock Loops A superblock loop is a superblock that has a frequently taken backedge from its last node to its first node. We will study the extension of some common loop optimizations to superblocks.

CMPUT 329 - Computer Organization and Architecture II32 Dependence Removing Optimizations The goal is to eliminate data dependences between instructions within frequently executed superblocks. Dependence removing optimizations include: Register renaming Accumulator variable expansion Induction variable expansion Search variable expansion Operation combining Strength reduction Tree height reduction

CMPUT 329 - Computer Organization and Architecture II33 Instruction Latencies for Examples

CMPUT 329 - Computer Organization and Architecture II34 Register Renaming Example For (j=0; j<n; j++) { C(j) = A(j)+B(j) } Original Loop L1: ld_f f2, A, r1(a) ld_f f3, B, r1(b) add_f f4, f2, f3(c) st_f C, r1, f4(d) add r1, r1, 4(e) blt r1, r5, L1 (f) Assembly Code For all the examples we assume a superscalar processor with infinite resources and no register renaming hardware. Thus for the code above, we obtain the following schedule.

CMPUT 329 - Computer Organization and Architecture II35 Register Renaming Example For (j=0; j<n; j++) { C(j) = A(j)+B(j) } Original Loop L1: ld_f f2, A, r1(a) ld_f f3, B, r1(b) add_f f4, f2, f3(c) st_f C, r1, f4(d) add r1, r1, 4(e) blt r1, r5, L1(f) Assembly Code aa bb ccc d e f 05 cycles Instr. Code Schedule 7 cycles / 1 iteration

CMPUT 329 - Computer Organization and Architecture II36 Register Renaming Example L1: ld_f f2, A, r1(a) ld_f f3, B, r1(b) add_f f4, f2, f3(c) st_f C, r1, f4(d) add r1, r1, 4(e) blt r1, r5, L1(f) Original Assembly Code L1: ld_f f2, A, r1(a) ld_f f3, B, r1(b) add_f f4, f2, f3(c) st_f C, r1, f4(d) add r1, r1, 4(e) ld_f f2, A, r1(f) ld_f f3, B, r1(g) add_f f4, f2, f3(h) st_f C, r1, f4(i) add r1, r1, 4(j) ld_f f2, A, r1(k) ld_f f3, B, r1(l) add_f f4, f2, f3(m) st_f C, r1, f4(n) add r1, r1, 4(o) blt r1, r5, L1(p) After Loop Unrolling

CMPUT 329 - Computer Organization and Architecture II37 Loop Unrolling aa bb ccc d e f 0 5 cycles Instr. Code Schedule f gg hhh i j kk ll mmm n o p 1015 19 cycles / 3 iterations = 6.3 cycles / iteration L1: ld_f f2, A, r1(a) ld_f f3, B, r1(b) add_f f4, f2, f3(c) st_f C, r1, f4(d) add r1, r1, 4(e) ld_f f2, A, r1(f) ld_f f3, B, r1(g) add_f f4, f2, f3(h) st_f C, r1, f4(i) add r1, r1, 4(j) ld_f f2, A, r1(k) ld_f f3, B, r1(l) add_f f4, f2, f3(m) st_f C, r1, f4(n) add r1, r1, 4(o) blt r1, r5, L1(p) After Loop Unrolling

CMPUT 329 - Computer Organization and Architecture II38 Register Renaming L1: ld_f f21, A, r11(a) ld_f f31, B, r11(b) add_f f41, f21, f31(c) st_f C, r11, f41(d) add r12, r11, 4(e) ld_f f22, A, r12(f) ld_f f32, B, r12(g) add_f f42, f22, f32(h) st_f C, r12, f42(i) add r13, r12, 4(j) ld_f f23, A, r13(k) ld_f f33, B, r13(l) add_f f43, f23, f33(m) st_f C, r13, f43(n) add r11, r13, 4(o) blt r11, r5, L1(p) After Register Renaming L1: ld_f f2, A, r1(a) ld_f f3, B, r1(b) add_f f4, f2, f3(c) st_f C, r1, f4(d) add r1, r1, 4(e) ld_f f2, A, r1(f) ld_f f3, B, r1(g) add_f f4, f2, f3(h) st_f C, r1, f4(i) add r1, r1, 4(j) ld_f f2, A, r1(k) ld_f f3, B, r1(l) add_f f4, f2, f3(m) st_f C, r1, f4(n) add r1, r1, 4(o) blt r1, r5, L1(p) After Loop Unrolling

CMPUT 329 - Computer Organization and Architecture II39 Loop Unrolling and Register Renaming Instr. aa bb ccc d e f 05 cycles Code Schedule f gg hhh i j kk ll mmm n o p 1015 8 cycles / 3 iterations = 2.7 cycles / iteration L1: ld_f f21, A, r11(a) ld_f f31, B, r11(b) add_f f41, f21, f31(c) st_f C, r11, f41(d) add r12, r11, 4(e) ld_f f22, A, r12(f) ld_f f32, B, r12(g) add_f f42, f22, f32(h) st_f C, r12, f42(i) add r13, r12, 4(j) ld_f f23, A, r13(k) ld_f f33, B, r13(l) add_f f43, f23, f33(m) st_f C, r13, f43(n) add r11, r13, 4(o) blt r11, r5, L1(p) After Register Renaming

CMPUT 329 - Computer Organization and Architecture II40 Accumulator Variable Expansion An accumulator variable accumulates a sum or product in each iteration of a loop. Accumulator variable expansion eliminates redefinitions of an accumulator variable within an unrolled loop by creating k temporary accumulators (k is the number of accumulation instructions). The values of all temporary accumulators must be summed at the exit points of the loop where the accumulator is live.

CMPUT 329 - Computer Organization and Architecture II41 Accumulator Expansion Example For (k=0; k<n; k++) { C(i,j) = C(i,j) + A(i,k) * B(k,j) } Original Loop ld_f f1, C, r2(-) L1: ld_f f3, A, r4(a) ld_f f5, B, r6(b) mul_f f7, f3, f5(c) add_f f1, f1, f7(d) add r4, r4, 4(e) add r6, r6, r8(f) blt r4, r9, L1(g) st_f C, r2, f1(-) Assembly Code For all examples we assume a superscalar processor with infinite resources and no register renaming hardware. Thus for the code above, we obtain the following schedule.

CMPUT 329 - Computer Organization and Architecture II42 Accumulator Expansion Example For (k=0; k<n; k++) { C(i,j) = C(i,j) + A(i,k) * B(k,j) } Original Loop Assembly Code aa bb ccc d e f 05 cycles Instr. Code Schedule g ld_f f1, C, r2(-) L1: ld_f f3, A, r4(a) ld_f f5, B, r6(b) mul_f f7, f3, f5(c) add_f f1, f1, f7(d) add r4, r4, 4(e) add r6, r6, r8(f) blt r4, r9, L1(g) st_f C, r2, f1(-) dd 8 cycles / 1 iteration

CMPUT 329 - Computer Organization and Architecture II43 Loop Unrolling and Register Renaming ld_f f1, C, r2(-) L1: ld_f f3, A, r4(a) ld_f f5, B, r6(b) mul_f f7, f3, f5(c) add_f f1, f1, f7(d) add r4, r4, 4(e) add r6, r6, r8(f) blt r4, r9, L1(g) st_f C, r2, f1(-) Assembly Code After Unrolling and Renaming ld_f f1, C, r2(-) L1: ld_f f31, A, r41(a) ld_f f51, B, r61(b) mul_f f71, f31, f51 (c) add_f f1, f1, f71(d) add r42, r41, 4(e) add r62, r61, r8(f) ld_f f32, A, r42(g) ld_f f52, B, r62(h) mul_f f72, f32, f52(i) add_f f1, f1, f72(j) add r43, r42, 4(k) add r63, r62, r8(l) ld_f f33, A, r43(m) ld_f f53, B, r63(n) mul_f f73, f33, f53(o) add_f f1, f1, f73(p) add r41, r43, 4(q) add r61, r63, r8(r) blt r4, r9, L1(s) st_f C, r2, f1(-)

CMPUT 329 - Computer Organization and Architecture II44 Loop Unrolling and Register Renaming aa bb ccc d e f 0 5 cycles Code Schedule gg hh i j k l 1015 dd ld_f f1, C, r2(-) L1: ld_f f31, A, r41(a) ld_f f51, B, r61(b) mul_f f71, f31, f51 (c) add_f f1, f1, f71(d) add r42, r41, 4(e) add r62, r61, r8(f) ld_f f32, A, r42(g) ld_f f52, B, r62(h) mul_f f72, f32, f52(i) add_f f1, f1, f72(j) add r43, r42, 4(k) add r63, r62, r8(l) ld_f f33, A, r43(m) ld_f f53, B, r63(n) mul_f f73, f33, f53(o) add_f f1, f1, f73(p) add r41, r43, 4(q) add r61, r63, r8(r) blt r4, r9, L1(s) st_f C, r2, f1(-) Instr. ii jj mm nn o p q r oo pp s 14 cycles / 3 iterations = 4.7 cycles / iteration

CMPUT 329 - Computer Organization and Architecture II45 Accumulator Expansion aa bb ccc d e f 0 5 cycles Code Schedule gg hh i j k l 1015 dd ld_f f11, C, r2(-) mov_f f12, 0(-) mov_f f13, 0(-) L1: ld_f f31, A, r41(a) ld_f f51, B, r61(b) mul_f f71, f31, f51 (c) add_f f11, f11, f71(d) add r42, r41, 4(e) add r62, r61, r8(f) ld_f f32, A, r42(g) ld_f f52, B, r62(h) mul_f f72, f32, f52(i) add_f f12, f12, f72(j) add r43, r42, 4(k) add r63, r62, r8(l) ld_f f33, A, r43(m) ld_f f53, B, r63(n) mul_f f73, f33, f53(o) add_f f13, f13, f73(p) add r41, r43, 4(q) add r61, r63, r8(r) blt r4, r9, L1(s) add_f f11, f11, f12 (-) add_f f11, f11, f13 (-) st_f C, r2, f1(-) Instr. ii jj mm nn o p q r oo pp s 10 cycles / 3 iterations = 3.3 cycles / iteration

CMPUT 329 - Computer Organization and Architecture II46 Induction Variable Expansion An induction variable is used to index through loop iterations and through regular data structure, such as arrays. Induction variable expansion eliminates dependences between definitions of induction variables and their uses in unrolled loops.

CMPUT 329 - Computer Organization and Architecture II47 Induction Variable Expansion Example For (i=0; i<n; i++) { C(j) = A(j) * B(j) j = j + K } Original Loop Assembly Code aa bb ccc d e f 05 cycles Instr. Code Schedule g L1: ld_f f3, A, r2(a) ld_f f4, B, r2(b) mul_f f5, f3, f4(c) st_f C, r2, f5(d) add r2, r2, r7(e) add r1, r1, 1(f) blt r1, r6, L1(g) 6 cycles / 1 iteration

CMPUT 329 - Computer Organization and Architecture II48 Loop Unrolling and Register Renaming Assembly Code After Unrolling and Renaming L1: ld_f f31, A, r21(a) ld_f f41, B, r21(b) mul_f f51, f31, f41 (c) st_f C, r21, f51(d) add r22, r21, r7(e) ld_f f32, A, r22(f) ld_f f42, B, r22(g) mul_f f52, f32, f42 (h) st_f C, r22, f52(i) add r23, r22, r7(j) ld_f f33, A, r23(k) ld_f f43, B, r23(l) mul_f f53, f33, f43 (m) st_f C, r23, f53(n) add r21, r23, r7(o) add r1, r1, 3(p) blt r1, r6, L1(q) L1: ld_f f3, A, r2(a) ld_f f4, B, r2(b) mul_f f5, f3, f4(c) st_f C, r2, f5(d) add r2, r2, r7(e) add r1, r1, 1(f) blt r1, r6, L1(g)

CMPUT 329 - Computer Organization and Architecture II49 Loop Unrolling and Register Renaming aa bb ccc d e 0 5 cycles Code Schedule ff gg h i j 1015 Instr. hh kk ll m n o p mm q 8 cycles / 3 iterations = 2.6 cycles / iteration After Unrolling and Renaming L1: ld_f f31, A, r21(a) ld_f f41, B, r21(b) mul_f f51, f31, f41 (c) st_f C, r21, f51(d) add r22, r21, r7(e) ld_f f32, A, r22(f) ld_f f42, B, r22(g) mul_f f52, f32, f42 (h) st_f C, r22, f52(i) add r23, r22, r7(j) ld_f f33, A, r23(k) ld_f f43, B, r23(l) mul_f f53, f33, f43 (m) st_f C, r23, f53(n) add r21, r23, r7(o) add r1, r1, 3(p) blt r1, r6, L1(q)

CMPUT 329 - Computer Organization and Architecture II50 Induction Variable Expansion aa bb ccc d 0 5 cycles Code Schedule ff gg h 1015 Instr. hh kk ll m p mm 6 cycles / 3 iterations = 2 cycles / iteration After Unrolling and Renaming mov r21, r2(-) add r22, r21, r7(-) add r23, r22, r7(-) mul r71, r7, 3(-) L1: ld_f f31, A, r21(a) ld_f f41, B, r21(b) mul_f f51, f31, f41 (c) st_f C, r21, f51(d) ld_f f32, A, r22(f) ld_f f42, B, r22(g) mul_f f52, f32, f42 (h) st_f C, r22, f52(i) ld_f f33, A, r23(k) ld_f f43, B, r23(l) mul_f f53, f33, f43 (m) st_f C, r23, f53(n) add r21, r21, r71(e) add r22, r22, r71(j) add r23, r23, r71(o) add r1, r1, 3(p) blt r1, r6, L1(q) e i j n o q

CMPUT 329 - Computer Organization and Architecture II51 Search Variable Expansion A search variable is a single value (p.e., a minimum or a maximum) computed for a collection of data. Search variable expansion eliminates dependences between definitions of search variables and their uses in unrolled loops. Each search variable is expanded into k temporary independent variables. At the exit of the loop the value of the original search variable is obtained by comparing the values of the temporary search variables.

CMPUT 329 - Computer Organization and Architecture II52 Superblock Scheduling Superblock scheduling is a two step process: Step 1: Build dependence graph Step 2: List scheduling using the dependence graph, instruction latencies, and resource constraints of the processor

CMPUT 329 - Computer Organization and Architecture II53 List Scheduling List scheduling employs heuristics to choose among all ready nodes, the combination of nodes that should be scheduled in the current cycle. A node is ready if: (i) all its parents in the dependence graph have been scheduled; (ii) the result produced by each parent is available; and (iii) the resources required by the node are available.

CMPUT 329 - Computer Organization and Architecture II54 Speculative Execution in Superblocks To produce an efficient schedule, the compiler must be able to move instructions above and below branches. R: x  y+z … S: bnz r1... P LIVE-OUT(BR) is the set of variables that may be used before being redefined when the branch BR is taken In the example, LIVE-OUT(S) is the set of variables that is live at point P. SB1 B2

CMPUT 329 - Computer Organization and Architecture II55 Speculative Execution in Superblocks If we want to move instruction R below the branch instruction S, two situations might occur: R: x  y+z … S: bnz r1... P 1) x  LIVE-OUT(S) 2) x  LIVE-OUT(S) What is the code that the compiler should produce for each situation? SB1 B2

CMPUT 329 - Computer Organization and Architecture II56 Speculative Execution in Superblocks If we want to move instruction R below the branch instruction S, two situations might occur: R: x  y+z … S: bnz r1... P 1) x  LIVE-OUT(S) insert a copy of instruction R in the branch target. 2) x  LIVE-OUT(S) no compensation code is required SB1 B2

CMPUT 329 - Computer Organization and Architecture II57 Speculative Execution in Superblocks … S: bnz r1 … R: x  y+z R’: x  y+z... P … S: bnz r1 … R: x  y+z... P 1) x  LIVE-OUT(S)2) x  LIVE-OUT(S) must introduce R’ in basic block B2 no compensation code is required SB1 B2 SB1 B2

CMPUT 329 - Computer Organization and Architecture II58 Speculative Execution in Superblocks Upward code motion is more common to reduce the critical path of a superblock. (p.e. moving a load instruction upward to hide the load latency). There are two major restrictions to move an instruction J from below to above a branch BR: Restriction 1: The destination of J is not in LIVE-OUT(BR). Restriction 2: J will never cause an exception that may terminate program execution when BR is taken.

CMPUT 329 - Computer Organization and Architecture II59 Speculative Execution in Superblocks Restriction 1 is usually removed by register renaming. By renaming the destination register of instruction J, we ensure that it is not in LIVE-OUT(BR). There are two extreme interpretations to restriction 2. Restricted Speculation Model: fully enforce restriction 2. Therefore only instructions that cannot cause expections are candidates for speculative execution (p. e. memory load, memory store, integer divide, and all floating point instructions cannot be speculated).

CMPUT 329 - Computer Organization and Architecture II60 Speculative Execution in Superblocks General Speculation Model: completely ignore restriction 2. Requires that the processor provide non-excepting or silent versions of all potentially excepting instructions in the instruction set architecure. If an exception occurs for a silent instruction, it is simply ignored, and garbage is written in the destination.

CMPUT 329 - Computer Organization and Architecture II61 Example for Speculative Execution avg = 0; weight = 0; count = 0; while(prt != NULL) { count = count + 1; if(prt->wt > 0) weight = weight - prt->wt; else weight = weight + prt->wt; prt = prt -> next; } if(count != 0) avg = weight/count C code segment (i1)ld_i r1, prt, 0 (i2)mov r7, 0 // avg (i3)mov r2, 0 // count (i4)mov r3, 0 // weight (i5)beq r1, 0, L3 (i6) L0:add r2, r2, 1 (i7)ld_i r4, r1, 0 // prt->wt (i8)bge r4, 0, L1 (i9)sub r3, r3, r4 (i10)jmp L2 (i11) L1: add r3, r3, r4 (i12) L2:ld_i r1, r1, 4 (i13)bne r1, 0, L0 (i14) L3:beq r2, 0, L4 (i15)div r7, r3, r2 (i16)st_i avg, 0, r7 (i17) L4: Assembly code segment

CMPUT 329 - Computer Organization and Architecture II62 BB2 BB4 BB5 Example for Speculative Execution (i1)ld_i r1, prt, 0 (i2)mov r7, 0 // avg (i3)mov r2, 0 // count (i4)mov r3, 0 // weight (i5)beq r1, 0, L3 (i6) L0:add r2, r2, 1 (i7)ld_i r4, r1, 0 // prt->wt (i8)bge r4, 0, L1 (i9)sub r3, r3, r4 (i10)jmp L2 (i11) L1: add r3, r3, r4 (i12) L2:ld_i r1, r1, 4 (i13)bne r1, 0, L0 (i14) L3:beq r2, 0, L4 (i15)div r7, r3, r2 (i16)st_i avg, 0, r7 (i17) L4: Assembly code segment i6 i7 i8 i11 i12 i13 i9 i10 10 90 99 1 1 Trace Selection for the Loop BB3

CMPUT 329 - Computer Organization and Architecture II63 BB2 BB4 BB5 BB2 BB4 Example for Speculative Execution i6 i7 i8 i11 i12 i13 i9 i10 10 90 99 1 1 Trace Selection for the Loop BB3 i6 i7 i8 i11 i12 i13 i9 i12’ i13’ 10 90 99(1/10) 1(9/10) 1 After superblock formation and branch target expansion BB3’ 1(1/10) 99(1/10) SB1 SB2

CMPUT 329 - Computer Organization and Architecture II64 Example for Speculative Execution BB2 BB4 BB5 i6 i7 i8 i11 i12 i13 i9 i12’ i13’ 10 90 99(1/10) 1(9/10) 1 After superblock formation and branch target expansion BB3’ 1(1/10) 99(1/10) SB1 SB2 ld_i r1, prt, 0 mov r7, 0 // avg mov r2, 0 // count mov r3, 0 // weight beq r1, 0, L3 (i6) L0:add r2, r2, 1 (i7)ld_i r4, r1, 0 // prt->wt (i8)bge r4, 0, LA (i11) add r3, r3, r4 (i12) ld_i r1, r1, 4 // prt->next (i13)bne r1, 0, L0 (i9) LA:sub r3, r3, r4 (i12’) ld_i r1, r1, 4 // prt->next (i13’)bne r1, 0, L0 (i14) L3:beq r2, 0, L4 (i15)div r7, r3, r2 (i16)st_i avg, 0, r7 (i17) L4: Assembly code segment

CMPUT 329 - Computer Organization and Architecture II65 Example for Speculative Execution ld_i r1, prt, 0 mov r7, 0 // avg mov r2, 0 // count mov r3, 0 // weight beq r1, 0, L3 (I1) L0:add r2, r2, 1 (I2)ld_i r4, r1, 0 // prt->wt (I3)blt r4, 0, L1 (I4) add r3, r3, r4 (I5) ld_i r5, r1, 4 // prt->next (I6)beq r5, 0, L3 (I7) add r2, r2, 1 (I8) ld_i r6, r5, 0 // prt->wt (I9)blt r6, 0, L1’ (I10)add r3, r3, r6 (I11)ld_i r1, r5, 4 // prt -> next (I12)bne r1, 0, L0 L3: beq r2, 0, L4 div r7, r3, r2 st_I avg, 0, r7 L4: L1’: mov r1, r5 mov r4, r6 L1: sub r32, r3, r4 ld_i r1, r1, 4 bne r1, 0, L0 ld_i r1, prt, 0 mov r7, 0 // avg mov r2, 0 // count mov r3, 0 // weight beq r1, 0, L3 (I1) L0:add r2, r2, 1 (I2)ld_i r4, r1, 0 // prt->wt (I3)blt r4, 0, L1 (I4) add r3, r3, r4 (I5) ld_i r5, r1, 4 // prt->next (I6)beq r5, 0, L3 (I7) add r2, r2, 1 (I8) ld_i r6, r5, 0 // prt->wt (I9)blt r6, 0, L1’ (I10)add r3, r3, r6 (I11)ld_i r1, r5, 4 // prt -> next (I12)bne r1, 0, L0 L3: beq r2, 0, L4 div r7, r3, r2 st_I avg, 0, r7 L4: L1’: mov r1, r5 mov r4, r6 L1: sub r32, r3, r4 ld_i r1, r1, 4 bne r1, 0, L0

CMPUT 329 - Computer Organization and Architecture II66 Example for Speculative Execution ld_i r1, prt, 0 mov r7, 0 // avg mov r2, 0 // count mov r3, 0 // weight beq r1, 0, L3 (I1) L0:add r2, r2, 1 (I2)ld_i r4, r1, 0 // prt->wt (I3)blt r4, 0, L1 (I4) add r3, r3, r4 (I5) ld_i r5, r1, 4 // prt->next (I6)beq r5, 0, L3 (I7) add r2, r2, 1 (I8) ld_i r6, r5, 0 // prt->wt (I9)blt r6, 0, L1’ (I10)add r3, r3, r6 (I11)ld_i r1, r5, 4 // prt -> next (I12)bne r1, 0, L0 div r7, r3, r2 st_I avg, 0, r7 L4: L1’: mov r1, r5 mov r4, r6 L1: sub r32, r3, r4 ld_i r1, r1, 4 bne r1, 0, L0 L3: beq r2, 0, L4

CMPUT 329 - Computer Organization and Architecture II67 Hyperblocks Suggested Reading Scott A. Mahlke’s Ph.D. Thesis, chap. 7.

CMPUT 329 - Computer Organization and Architecture II68 Hyperblock A hyperblock is a collection of connected basic blocks in which control may only enter through the first block (entry block). Control flow may leave from any number of blocks in the hyperblock. Before scheduling, all control flow between basic blocks within a hyperblock is removed via if-conversion.

CMPUT 329 - Computer Organization and Architecture II69 Hyperblock Formation A five-step procedure is used to form hyperblocks: 1. region identification 2. loop backedge coalescing 3. block selection 4. tail duplication 5. if-conversion

CMPUT 329 - Computer Organization and Architecture II70 Running Example: wc Mahlke uses the inner loop of wc, the program that counts the number of characters, words, and lines in a file for linux, as a running example.

CMPUT 329 - Computer Organization and Architecture II71 The source code linect =wordct = charct = token = 0; for ( ; ; ) A: if (--(fp)->cnt < 0) C:c = filbuf(fp); else B: c = *(fp)->ptr++; D: if (c == EOF) break; E: charct++; if ((‘ ‘ < c) && F: (c < 0177)) { H: if(! token) { K: wordct++; token++; } continue; } G: if (c == ‘\n’) I: linec++; J: else if ((c != ‘ ‘) && L: (c != ‘\t’)) continue; M: token = 0; }

CMPUT 329 - Computer Organization and Architecture II72 The Assembly Code LA: ld_i r98, r3, 0 add r27, r98, -1 st_i r3, 0, 27 blt r98, 1, LC LB: ld_i r30, r3, 4 add r29, r30, 1 st_i r3, 4, r29 ld_c r4, r30, 0 LD: beq r4, -1, EXIT LE: ld_I r33, r73, 0 add r32, r33, 1 st_I r73, 0, r32 bge 32, r4, LG LF: bge r4, 127, LG LH: bne 0, r2, LA LK: ld_I r36, r72, 0 add r35, r36, 1 st_I r72, 0, r35 add r2, r2, 1 jmp LA LG: beq r4, r10, LI LJ: bne r4, 32, LL LM: mov r2, 0 jmp LA LI: ld_I r39, r71, 0 add r38, r39, 1 st_I r71, 0, r38 jmp LM LL: bne r4, 9, LA jmp LM LC: mov Parm0, r3 jsr filbuf mov r4, Ret0 jmp LD

CMPUT 329 - Computer Organization and Architecture II73 Control Flow Graph E A CB D F H K G IJ L M 16K 105K 14 105K EXIT 61K 77K 28K 0 4K24K 22K 2K 4K 2K 28K 25 1 16K

CMPUT 329 - Computer Organization and Architecture II74 Statistics of the Example wc is formed by small basic blocks with a large percentage of branches It contains 13 basic blocks and 34 instructions: 14 branches: 8 conditional 5 unconditional 1 subroutine call

CMPUT 329 - Computer Organization and Architecture II75 Step 1: Region Identification A region is a group of basic blocks with a single entry block that dominates all the blocks in the region. Regions are used because they provide easy to compute outer boundaries for hyperblocks. A basic block can only reside in a single region. A second constraint imposed on region formation is that regions may not contain internal cycles (this constraint is relaxed later). In wc, the entire control flow graph forms a region.

CMPUT 329 - Computer Organization and Architecture II76 Step 2: Backedge Coalescing If-conversion only can remove non-loop branches. Thus we need to coaslece all back edges into a single backedge. This allows the control logic that choses which backedge is taken to be eliminated by if-conversion. To coalesce the backedges, we introduce a new node that will be the origin of the new single backedge. Then we retarget all existing backedges to this new node

CMPUT 329 - Computer Organization and Architecture II77 CFG Before Backedge Coalescing E A CB D F H K G IJ L M 16K 105K 14 105K EXIT 61K 77K 28K 0 4K24K 22K 2K 4K 2K 28K 25 1 16K

CMPUT 329 - Computer Organization and Architecture II78 CFG After Backedge Coalescing E A CB D F H K G IJ L M 16K 105K 14 105K EXIT 61K 77K 28K 0 4K24K 22K 2K 4K 2K 28K 25 N 105K 1 16K

CMPUT 329 - Computer Organization and Architecture II79 Step 3: Block Selection Two conflicting goals: (1) More blocks can potentially improve performance by eliminating branches among the blocks included. (2) Too many blocks may result in performance loss due to over-saturation of processor resources or increased dependence height.

CMPUT 329 - Computer Organization and Architecture II80 Enumerating Execution Paths An execution path is a path of control flow from the entry block to an exit block in the region. Mahlke assigns a priority to each execution path. This priority indicates the path relative importance. Paths are included in the hyperblock from the highest to the lowest priority based on the available resources. Mahlke also estimates the available resources and the resource use of each path.

CMPUT 329 - Computer Organization and Architecture II81 Path Priority Function The path priority function combines four elements: (1) path execution frequency; (2) number of instructions in the path; (3) path dependence height; (4) hazard conditions on the path; Intuition: include paths with fewer instructions, with lower dependence height, that have few hazard conditions, and that are executed very often. Hazard conditions include procedure calls and unresolvable memory stores.

CMPUT 329 - Computer Organization and Architecture II82 Path Priority Function Malhke use a hazard multiplier of 0.25 for all paths containing a subroutine call or an unresolvable memory reference, and 1.0 for all other paths.

CMPUT 329 - Computer Organization and Architecture II83 Path Priority Function The constant K makes the path with the largest dependence height and the most operations have a non-zero probability. Malhke used K=0.1.

CMPUT 329 - Computer Organization and Architecture II84 Block Selection Algorithm ISSUE_WIDTH = 1 to 8 /* as specified in the machine description file */ RES_MULTIPLIER = 2 MAX_DEP_GROWTH = 3 MIN_PATH_PRIORITY_RATIO = 0.10 block_selection(region) { enumerate all paths in the region calculate priority of each path sort paths from highest to lowest priority /* Initialization of loop variables */ avail_resources = ISSUE_WIDTH  dep_height 1  RES_MULTIPLIER used_resources = 0 last_priority = 0.0 selected_paths = 0 for (i = 1 to num_paths) { /* Check if there are enough resources available to include the path */ if ((num_ops i + used_resources) > avail_resources) { continue } /* Prevent paths with large relative dependence heights from being included */ if (dep_height i > (dep_height 1  MAX_DEP_GROWTH)) { continue }

CMPUT 329 - Computer Organization and Architecture II85 Block Selection Algorithm /* Prevent paths with large relative dependence heights from being included */ if (dep_height i > (dep_height 1  MAX_DEP_GROWTH)) { continue } /* Do not include paths with a small relative priority to that of the last included path */ if (priority i < (last_priority  MIN_PATH_PRIORITY_RATIO)) { continue } /* Include the path in the hyperblock */ selected_paths = selected_paths  path i used_resources = used_resources + num_ops i last_priority = priority i } selected_blocks = all blocks contained within selected_paths return selected_blocks }

CMPUT 329 - Computer Organization and Architecture II86 Block Selection E A CB D F H K G IJ L M 16K 105K 14 105K EXIT 61K 77K 28K 0 4K24K 22K 2K 4K 2K 28K 25 N 105K 1 16K 1. A-B-D-E-F-H-N 2. A-B-D-E-F-H-K-N 3. A-B-D-E-G-J-M-N 4. A-B-D-E-G-J-L-M-N 5. A-B-D-E-G-I-M-N 6. A-B-D-E-G-J-L-N 7. A-B-D 8. A-C-D-E-F-H-N 9. A-C-D-E-F-H-K-N 10. A-C-D-E-G-J-M-N 11. A-C-D-E-G-J-L-M-N 12. A-C-D-E-G-I-M-N 13. A-C-D-E-G-J-L-N 14. A-C-D 15. A-B-D-E-F-G-I-M-N 16. A-B-D-E-F-G-J-M-N 17. A-B-D-E-F-G-J-L-M-N 18. A-B-D-E-F-G-J-L-N 19. A-C-D-E-F-G-I-M-N 20. A-C-D-E-F-G-J-M-N 21. A-C-D-E-F-G-J-L-M-N 22. A-C-D-E-F-G-J-L-N

CMPUT 329 - Computer Organization and Architecture II88 Path Selection Some paths that are not selected by the block selection algorithms are also included in the hyperblocks because all their blocks belong to selected paths. An alternative procedure could have eliminated these paths from the path set before the selection. But the cost of such elimination would be higher than maintaining these extra paths in the set.

CMPUT 329 - Computer Organization and Architecture II90 Step 4: Tail Duplication To convert the set of selected blocks into a hyperblock (with a single entry block), control flow from non-selected blocks (side entry points) must be eliminated. The tail duplication algorithm first marks all blocks that have side entry points. Then the algorithm marks all blocks that can be reached from marked blocks. All marked blocks form the tails that must be duplicated.

CMPUT 329 - Computer Organization and Architecture II91 Tail Duplication E A CB D F H K G IJ L M 16K 105K 14 105K EXIT 61K 77K 28K 0 4K24K 22K 2K 4K 2K 28K 25 N 105K 1 16K

CMPUT 329 - Computer Organization and Architecture II92 Tail Duplication E A CB D F H K G IJ L M 16K 105K 14 105K EXIT 61K 77K 28K 0 4K24K 22K 2K 4K 2K 28K 25 N 105K 1 16K

CMPUT 329 - Computer Organization and Architecture II93 Tail Duplication E A CB D F H K G IJ L M 16K 105K 14 105K EXIT 61K 77K 28K 0 4K24K 22K 2K 4K 2K 28K 25 N 1 16K E’ D’ F’ H’ K’ G’ I’J’ L’ M’ 2 14 8 10 4 0 13 3 0 1 0 4 0 N’ 105K 0 2 14

CMPUT 329 - Computer Organization and Architecture II94 Anatomy of a Predicate Computation Operation p P out1 (  type  ), P out2 (  type  ), src1, src2 (P in ) This instruction assigns value to P out1 and P out2 : The value assigned depends on: The result of the comparison The value of P in The type of P out1 and P out2

CMPUT 329 - Computer Organization and Architecture II95 Anatomy of a Predicate Computation Operation p P out1 (  type  ), P out2 (  type  ), src1, src2 (P in ) = eq | ne | gt = U | U | OR | OR | AND | AND Example: pge p4(OR), p2(/U), r4, 127 (p1) cmp = ge, P in = p1, P out1 = p4, P out2 = p2, src1 = r4, src2 = 127

CMPUT 329 - Computer Organization and Architecture II96 Anatomy of a Predicate Computation Operation p P out1 (  type  ), P out2 (  type  ), src1, src2 (P in ) = U | U | OR | OR | AND | AND U or U Always write into the destination register: if type = U then if P in = 0 then P out = 0 elseif src1 src2 then P out = 1 else P out = 0 if type = U then if P in = 0 then P out = 0 elseif src1 src2 then P out = 0 else P out = 1

CMPUT 329 - Computer Organization and Architecture II97 Anatomy of a Predicate Computation Operation p P out1 (  type  ), P out2 (  type  ), src1, src2 (P in ) = U | U | OR | OR | AND | AND Write into the destination register only if P in = 1 and is true: if type = OR and P in = 1 and src1 src2 then P out = 1 Used when the execution of a block is enabled by one of multiple conditions. OR type predicates must be initialized to 0 before their use. OR or OR if type = OR and P in = 1 and src1 ! src2 then P out = 1

CMPUT 329 - Computer Organization and Architecture II98 Anatomy of a Predicate Computation Operation p P out1 (  type  ), P out2 (  type  ), src1, src2 (P in ) = U | U | OR | OR | AND | AND Write into the destination register only if P in = 1 and is false: if type =AND and P in = 1 and src1 ! src2 then P out = 0 Used when the execution of a block requires several conditions to be true. AND type predicates are often initialized to 1. AND or AND if type = AND and P in = 1 and src1 src2 then P out = 0

CMPUT 329 - Computer Organization and Architecture II99 Predicate Comparison Truth Table P in predicates the entire predicate computation instruction. Notice that for an unconditional type, the value 0 is written in P out even when P in is 0. p P out1 (  type  ), P out2 (  type  ), src1, src2 (P in )

CMPUT 329 - Computer Organization and Architecture II100 Predicate Comparison Truth Table pge p4(OR), p2(/U), r4, 127 (p1) Example:

CMPUT 329 - Computer Organization and Architecture II101 Predicate Types Unconditional predicates are used for control dependence sets that have a single edge. OR-type predicates are used for predicates with multiple edges in their control dependence sets. (OR-type predicates must be cleared before entering the hyperblock).

CMPUT 329 - Computer Organization and Architecture II102 Step 5: If-conversion For graph drawing, Malhke uses the convention that the left edge out of a basic block is the true condition and the right one is the false. G IJ In this control flow graph the control dependencies on blocks I and J are: I: brG J: /brG

CMPUT 329 - Computer Organization and Architecture II103 Step 5: If-conversion E A CB D F H K G IJ L M 16K 105K 14 105K EXIT 61K 77K 28K 0 4K24K 22K 2K 4K 2K 28K 25 N 105K 1 16K D’-N’ 14

CMPUT 329 - Computer Organization and Architecture II104 Step 5: If-conversion E A CB D F H K G IJ L M 16K 105K 14 105K EXIT 61K 77K 28K 0 4K24K 22K 2K 4K 2K 28K 25 N 105K 1 16K D’-N’ 14

CMPUT 329 - Computer Organization and Architecture II105 EXIT 4K H 77K 24K Step 5: If-conversion (example) IJ A CB D KL M 16K 105K 14 105K 61K 77K 28K 0 22K 2K 4K 2K 28K 25 N 105K 1 16K D’-N’ 14 E F G

CMPUT 329 - Computer Organization and Architecture II106 EXIT 4K H 77K 24K Step 5: If-conversion (example) IJ A CB D KL M 16K 105K 14 105K 61K 77K 28K 0 22K 2K 4K 2K 28K 25 N 105K 1 16K D’-N’ 14 E F G LA: ld_i r98, r3, 0 add r27, r98, -1 st_i r3, 0, 27 blt r98, 1, LC LB: ld_i r30, r3, 4 add r29, r30, 1 st_i r3, 4, r29 ld_c r4, r30, 0 LD: beq r4, -1, EXIT LE: ld_I r33, r73, 0 add r32, r33, 1 st_I r73, 0, r32 bge 32, r4, LG LF: bge r4, 127, LG LH: bne 0, r2, LA LK: ld_I r36, r72, 0 add r35, r36, 1 st_I r72, 0, r35 add r2, r2, 1 jmp LA LG: beq r4, r10, LI LJ: bne r4, 32, LL LM: mov r2, 0 jmp LA LI: ld_I r39, r71, 0 add r38, r39, 1 st_I r71, 0, r38 jmp LM LL: bne r4, 9, LA jmp LM LC: mov Parm0, r3 jsr filbuf mov r4, Ret0 jmp LD

CMPUT 329 - Computer Organization and Architecture II107 EXIT 4K H 77K 24K Step 5: If-conversion (example) IJ A CB D KL M 16K 105K 14 105K 61K 77K 28K 0 22K 2K 4K 2K 28K 25 N 105K 1 16K D’-N’ 14 E F G pclr p4, p6 ld_i r98, r3, 0 add r27, r98, -1 st_i r3, 0, r27 blt r98, 1, LC ld_i r30, r3, 4 add r29, r30, 1 st_i r3, 4, r29 ld_c r4, r30, 0 beq r4, -1, EXIT ld_I r33, r73, 0 add r32, r33, 1 st_I r73, 0, r32 pge p4(OR), p1(/U), 32, r4 pge p4(OR), p2(/U), r4, 127 (p1) peq p3(U),-,0,r2 (p2) peq p6(OR), p5(/U), r4, r10 (p4) peq p7(U), -, r4, r10 (p4)...

CMPUT 329 - Computer Organization and Architecture II108 Step 5: If-conversion (example) pclr p4, p6 ld_i r98, r3, 0 add r27, r98, -1 st_i r3, 0, r27 blt r98, 1, LC ld_i r30, r3, 4 add r29, r30, 1 st_i r3, 4, r29 ld_c r4, r30, 0 beq r4, -1, EXIT ld_I r33, r73, 0 add r32, r33, 1 st_I r73, 0, r32 pge p4(OR), p1(/U), 32, r4 pge p4(OR), p2(/U), r4, 127 (p1) peq p3(U),-,0,r2 (p2) peq p6(OR), p5(/U), r4, r10 (p4) peq p7(U), -, r4, r10 (p4)... EXIT 4K H 77K 24K IJ 105K 77K 28K 0 1 E F G LA: ld_i r98, r3, 0 add r27, r98, -1 st_i r3, 0, 27 blt r98, 1, LC LB: ld_i r30, r3, 4 add r29, r30, 1 st_i r3, 4, r29 ld_c r4, r30, 0 LD: beq r4, -1, EXIT LE: ld_I r33, r73, 0 add r32, r33, 1 st_I r73, 0, r32 bge 32, r4, LG LF: bge r4, 127, LG LH: bne 0, r2, LA LK: ld_I r36, r72, 0 add r35, r36, 1 st_I r72, 0, r35 add r2, r2, 1 jmp LA LG: beq r4, r10, LI LJ: bne r4, 32, LL LM: mov r2, 0 jmp LA LI: ld_I r39, r71, 0 add r38, r39, 1 st_I r71, 0, r38 jmp LM LL: bne r4, 9, LA jmp LM LC: mov Parm0, r3 jsr filbuf mov r4, Ret0 jmp LD

CMPUT 329 - Computer Organization and Architecture II109 Inner Loop After If- conversion pclr p4, p6 ld_I r98, r3, 0 add r27, r98, -1 st_I r3, 0, r27 ld_i r30, r3, 4 add r29, r30, 1 st_I r3, 4, r29 ld_c r4, r30, 0 ld_I r33, r73, 0 add r32, r33, 1 st_I r73, 0, r32 pge p4(OR), p1(/U), 32, r4 pge p4(OR), p2(/U), r4, 127 (p1) peq p3(U),-,0,r2 (p2) peq p6(OR), p5(/U), r4, r10 (p4) peq p7(U), -, r4, r10 (p4) peq p6(OR), p8(/U), r4, 32 (p5) ld_I r36, r72, 0 (p3) add r35, r36, 1 (p3) st_I r72, 0, r35 (p3) add r2, r2, 1 (p3) ld_I r39, r71, 0 (p7) add r38, r39, 1 (p7) st_I r71, 0, r38 (p7) peq p6(OR), -, r4, 9 (p8) mov r2, 0 (p6) jmp loop blt r98, 1, LC beq r4, -1, EXIT

CMPUT 329 - Computer Organization and Architecture II110 Predicate Hierarchy Graph The Predicate Hierarchy Graph (PHG) is a directed acyclic graph representing the Boolean equations used to compute all the predicates in a hyperblock. There are two types of nodes in the PHG: predicate nodes and condition nodes. Two PHG nodes x and y are connected if the value specified by x is used to directly compute the value of y. The PHG is used to derive relationships among predicates.

CMPUT 329 - Computer Organization and Architecture II111 Example of PHG Construction pclr p4, p6 ld_I r98, r3, 0 add r27, r98, -1 st_I r3, 0, r27 blt r98, 1, LC ld_i r30, r3, 4 add r29, r30, 1 st_I r3, 4, r29 ld_c r4, r30, 0 beq r4, -1, EXIT ld_I r33, r73, 0 add r32, r33, 1 st_I r73, 0, r32 pge p4(OR), p1(/U), 32, r4[c1, /c1] pge p4(OR), p2(/U), r4, 127 (p1)[c2, /c2] peq p3(U),-,0,r2 (p2)[c3] peq p6(OR), p5(/U), r4, r10 (p4)[c4, /c4] peq p7(U), -, r4, r10 (p4)[c4] peq p6(OR), p8(/U), r4, 32 (p5)[c5, /c5] ld_I r36, r72, 0 (p3) add r35, r36, 1 (p3) st_I r72, 0, r35 (p3) add r2, r2, 1 (p3) ld_I r39, r71, 0 (p7) add r38, r39, 1 (p7) st_I r71, 0, r38 (p7) peq p6(OR), -, r4, 9 (p8) [c6] mov r2, 0 (p6) jmp loop T

CMPUT 329 - Computer Organization and Architecture II112 Example of PHG Construction pclr p4, p6 ld_I r98, r3, 0 add r27, r98, -1 st_I r3, 0, r27 blt r98, 1, LC ld_i r30, r3, 4 add r29, r30, 1 st_I r3, 4, r29 ld_c r4, r30, 0 beq r4, -1, EXIT ld_I r33, r73, 0 add r32, r33, 1 st_I r73, 0, r32 pge p4(OR), p1(/U), 32, r4[c1, /c1] pge p4(OR), p2(/U), r4, 127 (p1)[c2, /c2] peq p3(U),-,0,r2 (p2)[c3] peq p6(OR), p5(/U), r4, r10 (p4)[c4, /c4] peq p7(U), -, r4, r10 (p4)[c4] peq p6(OR), p8(/U), r4, 32 (p5)[c5, /c5] ld_I r36, r72, 0 (p3) add r35, r36, 1 (p3) st_I r72, 0, r35 (p3) add r2, r2, 1 (p3) ld_I r39, r71, 0 (p7) add r38, r39, 1 (p7) st_I r71, 0, r38 (p7) peq p6(OR), -, r4, 9 (p8) [c6] mov r2, 0 (p6) jmp loop T pge p4(OR), p1(/U), 32, r4[c1, /c1] c1/c1 p1 p4

CMPUT 329 - Computer Organization and Architecture II113 Example of PHG Construction pclr p4, p6 ld_I r98, r3, 0 add r27, r98, -1 st_I r3, 0, r27 blt r98, 1, LC ld_i r30, r3, 4 add r29, r30, 1 st_I r3, 4, r29 ld_c r4, r30, 0 beq r4, -1, EXIT ld_I r33, r73, 0 add r32, r33, 1 st_I r73, 0, r32 pge p4(OR), p1(/U), 32, r4[c1, /c1] pge p4(OR), p2(/U), r4, 127 (p1)[c2, /c2] peq p3(U),-,0,r2 (p2)[c3] peq p6(OR), p5(/U), r4, r10 (p4)[c4, /c4] peq p7(U), -, r4, r10 (p4)[c4] peq p6(OR), p8(/U), r4, 32 (p5)[c5, /c5] ld_I r36, r72, 0 (p3) add r35, r36, 1 (p3) st_I r72, 0, r35 (p3) add r2, r2, 1 (p3) ld_I r39, r71, 0 (p7) add r38, r39, 1 (p7) st_I r71, 0, r38 (p7) peq p6(OR), -, r4, 9 (p8) [c6] mov r2, 0 (p6) jmp loop T pge p4(OR), p2(/U), r4, 127 (p1)[c2, /c2] c1/c1 p1 c2/c2 p4p2

CMPUT 329 - Computer Organization and Architecture II114 Example of PHG Construction pclr p4, p6 ld_I r98, r3, 0 add r27, r98, -1 st_I r3, 0, r27 blt r98, 1, LC ld_i r30, r3, 4 add r29, r30, 1 st_I r3, 4, r29 ld_c r4, r30, 0 beq r4, -1, EXIT ld_I r33, r73, 0 add r32, r33, 1 st_I r73, 0, r32 pge p4(OR), p1(/U), 32, r4[c1, /c1] pge p4(OR), p2(/U), r4, 127 (p1)[c2, /c2] peq p3(U),-,0,r2 (p2)[c3] peq p6(OR), p5(/U), r4, r10 (p4)[c4, /c4] peq p7(U), -, r4, r10 (p4)[c4] peq p6(OR), p8(/U), r4, 32 (p5)[c5, /c5] ld_I r36, r72, 0 (p3) add r35, r36, 1 (p3) st_I r72, 0, r35 (p3) add r2, r2, 1 (p3) ld_I r39, r71, 0 (p7) add r38, r39, 1 (p7) st_I r71, 0, r38 (p7) peq p6(OR), -, r4, 9 (p8) [c6] mov r2, 0 (p6) jmp loop T peq p3(U),-,0,r2 (p2)[c3] c1/c1 p1 c2/c2 p4p2 c3 p3

CMPUT 329 - Computer Organization and Architecture II115 Example of PHG Construction pclr p4, p6 ld_I r98, r3, 0 add r27, r98, -1 st_I r3, 0, r27 blt r98, 1, LC ld_i r30, r3, 4 add r29, r30, 1 st_I r3, 4, r29 ld_c r4, r30, 0 beq r4, -1, EXIT ld_I r33, r73, 0 add r32, r33, 1 st_I r73, 0, r32 pge p4(OR), p1(/U), 32, r4[c1, /c1] pge p4(OR), p2(/U), r4, 127 (p1)[c2, /c2] peq p3(U),-,0,r2 (p2)[c3] peq p6(OR), p5(/U), r4, r10 (p4)[c4, /c4] peq p7(U), -, r4, r10 (p4)[c4] peq p6(OR), p8(/U), r4, 32 (p5)[c5, /c5] ld_I r36, r72, 0 (p3) add r35, r36, 1 (p3) st_I r72, 0, r35 (p3) add r2, r2, 1 (p3) ld_I r39, r71, 0 (p7) add r38, r39, 1 (p7) st_I r71, 0, r38 (p7) peq p6(OR), -, r4, 9 (p8) [c6] mov r2, 0 (p6) jmp loop T peq p6(OR), p5(/U), r4, r10 (p4)[c4, /c4] c1/c1 p1 c2/c2 p4 p5 c4/c4 p6 p2 c3 p3

CMPUT 329 - Computer Organization and Architecture II116 Example of PHG Construction pclr p4, p6 ld_I r98, r3, 0 add r27, r98, -1 st_I r3, 0, r27 blt r98, 1, LC ld_i r30, r3, 4 add r29, r30, 1 st_I r3, 4, r29 ld_c r4, r30, 0 beq r4, -1, EXIT ld_I r33, r73, 0 add r32, r33, 1 st_I r73, 0, r32 pge p4(OR), p1(/U), 32, r4[c1, /c1] pge p4(OR), p2(/U), r4, 127 (p1)[c2, /c2] peq p3(U),-,0,r2 (p2)[c3] peq p6(OR), p5(/U), r4, r10 (p4)[c4, /c4] peq p7(U), -, r4, r10 (p4)[c4] peq p6(OR), p8(/U), r4, 32 (p5)[c5, /c5] ld_I r36, r72, 0 (p3) add r35, r36, 1 (p3) st_I r72, 0, r35 (p3) add r2, r2, 1 (p3) ld_I r39, r71, 0 (p7) add r38, r39, 1 (p7) st_I r71, 0, r38 (p7) peq p6(OR), -, r4, 9 (p8) [c6] mov r2, 0 (p6) jmp loop T peq p7(U), -, r4, r10 (p4)[c4] c1/c1 p1 c2/c2 p4 p5 c4 /c4 p6 p2 c3 p3 p7

CMPUT 329 - Computer Organization and Architecture II117 Example of PHG Construction pclr p4, p6 ld_I r98, r3, 0 add r27, r98, -1 st_I r3, 0, r27 blt r98, 1, LC ld_i r30, r3, 4 add r29, r30, 1 st_I r3, 4, r29 ld_c r4, r30, 0 beq r4, -1, EXIT ld_I r33, r73, 0 add r32, r33, 1 st_I r73, 0, r32 pge p4(OR), p1(/U), 32, r4[c1, /c1] pge p4(OR), p2(/U), r4, 127 (p1)[c2, /c2] peq p3(U),-,0,r2 (p2)[c3] peq p6(OR), p5(/U), r4, r10 (p4)[c4, /c4] peq p7(U), -, r4, r10 (p4)[c4] peq p6(OR), p8(/U), r4, 32 (p5)[c5, /c5] ld_I r36, r72, 0 (p3) add r35, r36, 1 (p3) st_I r72, 0, r35 (p3) add r2, r2, 1 (p3) ld_I r39, r71, 0 (p7) add r38, r39, 1 (p7) st_I r71, 0, r38 (p7) peq p6(OR), -, r4, 9 (p8)[c6] mov r2, 0 (p6) jmp loop T peq p6(OR), p8(/U), r4, 32 (p5)[c5, /c5] c1/c1 p1 c2/c2 p4 p5 c5/c5 p8 c4 /c4 p6 p2 c3 p3 p7

CMPUT 329 - Computer Organization and Architecture II118 Example of PHG Construction pclr p4, p6 ld_I r98, r3, 0 add r27, r98, -1 st_I r3, 0, r27 blt r98, 1, LC ld_i r30, r3, 4 add r29, r30, 1 st_I r3, 4, r29 ld_c r4, r30, 0 beq r4, -1, EXIT ld_I r33, r73, 0 add r32, r33, 1 st_I r73, 0, r32 pge p4(OR), p1(/U), 32, r4[c1, /c1] pge p4(OR), p2(/U), r4, 127 (p1)[c2, /c2] peq p3(U),-,0,r2 (p2)[c3] peq p6(OR), p5(/U), r4, r10 (p4)[c4, /c4] peq p7(U), -, r4, r10 (p4)[c4] peq p6(OR), p8(/U), r4, 32 (p5)[c5, /c5] ld_I r36, r72, 0 (p3) add r35, r36, 1 (p3) st_I r72, 0, r35 (p3) add r2, r2, 1 (p3) ld_I r39, r71, 0 (p7) add r38, r39, 1 (p7) st_I r71, 0, r38 (p7) peq p6(OR), -, r4, 9 (p8)[c6] mov r2, 0 (p6) jmp loop T peq p6(OR), -, r4, 9 (p8)[c6] c1/c1 p1 c2/c2 p4 p5 c5/c5 p8 c6 c4 /c4 p6 p2 c3 p3 p7

CMPUT 329 - Computer Organization and Architecture II119 Example of PHG Construction pclr p4, p6 ld_I r98, r3, 0 add r27, r98, -1 st_I r3, 0, r27 blt r98, 1, LC ld_i r30, r3, 4 add r29, r30, 1 st_I r3, 4, r29 ld_c r4, r30, 0 beq r4, -1, EXIT ld_I r33, r73, 0 add r32, r33, 1 st_I r73, 0, r32 pge p4(OR), p1(/U), 32, r4[c1, /c1] pge p4(OR), p2(/U), r4, 127 (p1)[c2, /c2] peq p3(U),-,0,r2 (p2)[c3] peq p6(OR), p5(/U), r4, r10 (p4)[c4, /c4] peq p7(U), -, r4, r10 (p4)[c4] peq p6(OR), p8(/U), r4, 32 (p5)[c5, /c5] ld_I r36, r72, 0 (p3) add r35, r36, 1 (p3) st_I r72, 0, r35 (p3) add r2, r2, 1 (p3) ld_I r39, r71, 0 (p7) add r38, r39, 1 (p7) st_I r71, 0, r38 (p7) peq p6(OR), -, r4, 9 (p8)[c6] mov r2, 0 (p6) jmp loop T c1/c1 p1 c2/c2 p4 p5 c5/c5 p8 c6 c4 /c4 p6 p2 c3 p3 p7

CMPUT 329 - Computer Organization and Architecture II120 Purpose of PHG The PHG is used to allow the compiler to derive relations among the predicates. Mahlke identifies three predicate relations: Ancestor: p i is an ancestor of p j if all conditions used to compute p j are derived from p i. The compiler can be sure that p j may be true only when p i is also true. Control Path: There is a control path between p i and p j if there is at least one set of conditions under which both p j and p i are true. The compiler knows that p i and p j may be true at the same time. Implies: p i implies p j if the conditions that make p i true guatantee that p j will also be true.

CMPUT 329 - Computer Organization and Architecture II121 Imply Relationship pclr p4, p6 ld_I r98, r3, 0 add r27, r98, -1 st_I r3, 0, r27 blt r98, 1, LC ld_i r30, r3, 4 add r29, r30, 1 st_I r3, 4, r29 ld_c r4, r30, 0 beq r4, -1, EXIT ld_I r33, r73, 0 add r32, r33, 1 st_I r73, 0, r32 pge p4(OR), p1(/U), 32, r4[c1, /c1] pge p4(OR), p2(/U), r4, 127 (p1)[c2, /c2] peq p3(U),-,0,r2 (p2)[c3] peq p6(OR), p5(/U), r4, r10 (p4)[c4, /c4] peq p7(U), -, r4, r10 (p4)[c4] peq p6(OR), p8(/U), r4, 32 (p5)[c5, /c5] ld_I r36, r72, 0 (p3) add r35, r36, 1 (p3) st_I r72, 0, r35 (p3) add r2, r2, 1 (p3) ld_I r39, r71, 0 (p7) add r38, r39, 1 (p7) st_I r71, 0, r38 (p7) peq p6(OR), -, r4, 9 (p8)[c6] mov r2, 0 (p6) jmp loop T c1/c1 p1 c2/c2 p4 p5 c5/c5 p8 c6 c4 /c4 p6 p2 c3 p3 p7 p7 implies p6

CMPUT 329 - Computer Organization and Architecture II122 Ancestor Relationship pclr p4, p6 ld_I r98, r3, 0 add r27, r98, -1 st_I r3, 0, r27 blt r98, 1, LC ld_i r30, r3, 4 add r29, r30, 1 st_I r3, 4, r29 ld_c r4, r30, 0 beq r4, -1, EXIT ld_I r33, r73, 0 add r32, r33, 1 st_I r73, 0, r32 pge p4(OR), p1(/U), 32, r4[c1, /c1] pge p4(OR), p2(/U), r4, 127 (p1)[c2, /c2] peq p3(U),-,0,r2 (p2)[c3] peq p6(OR), p5(/U), r4, r10 (p4)[c4, /c4] peq p7(U), -, r4, r10 (p4)[c4] peq p6(OR), p8(/U), r4, 32 (p5)[c5, /c5] ld_I r36, r72, 0 (p3) add r35, r36, 1 (p3) st_I r72, 0, r35 (p3) add r2, r2, 1 (p3) ld_I r39, r71, 0 (p7) add r38, r39, 1 (p7) st_I r71, 0, r38 (p7) peq p6(OR), -, r4, 9 (p8)[c6] mov r2, 0 (p6) jmp loop T c1/c1 p1 c2/c2 p4 p5 c5/c5 p8 c6 c4 /c4 p6 p2 c3 p3 p7 Which predicate nodes are ancestors of p5? T, p4, and p5

CMPUT 329 - Computer Organization and Architecture II123 Ancestor Relationship pclr p4, p6 ld_I r98, r3, 0 add r27, r98, -1 st_I r3, 0, r27 blt r98, 1, LC ld_i r30, r3, 4 add r29, r30, 1 st_I r3, 4, r29 ld_c r4, r30, 0 beq r4, -1, EXIT ld_I r33, r73, 0 add r32, r33, 1 st_I r73, 0, r32 pge p4(OR), p1(/U), 32, r4[c1, /c1] pge p4(OR), p2(/U), r4, 127 (p1)[c2, /c2] peq p3(U),-,0,r2 (p2)[c3] peq p6(OR), p5(/U), r4, r10 (p4)[c4, /c4] peq p7(U), -, r4, r10 (p4)[c4] peq p6(OR), p8(/U), r4, 32 (p5)[c5, /c5] ld_I r36, r72, 0 (p3) add r35, r36, 1 (p3) st_I r72, 0, r35 (p3) add r2, r2, 1 (p3) ld_I r39, r71, 0 (p7) add r38, r39, 1 (p7) st_I r71, 0, r38 (p7) peq p6(OR), -, r4, 9 (p8)[c6] mov r2, 0 (p6) jmp loop T c1/c1 p1 c2/c2 p4 p5 c5/c5 p8 c6 c4 /c4 p6 p2 c3 p3 p7 Which predicate nodes are in the same control path as p5? T, p1, p4, p5, p6, p8

CMPUT 329 - Computer Organization and Architecture II124 Classical/ILP Optimizations in Predicated Code Example: Copy Propagation A:movr1, r2 (p1) B:addr2, r3, r4 (p2) C:ld_ir5, r1, 0 (p3) Is the copy propagation from instruction A to instruction C legal? Depends on what we know about the relationship between p1, p2, and p3. If it is possible that p1 is false and p3 is true, the propagation would be wrong! A:movr1, r2 (p1) B:addr2, r3, r4 (p2) C:ld_ir5, r2, 0 (p3)

CMPUT 329 - Computer Organization and Architecture II125 Classical/ILP Optimizations in Predicated Code Example: Copy Propagation A:movr1, r2 (p1) B:addr2, r3, r4 (p2) C:ld_ir5, r1, 0 (p3) For instance, if we know that: (1) p1 is an ancestor of both p2 and p3, and (2) p2 and p3 are mutually exclusive Then we can do the copy propagation safely. p1 pk cm/cm p2 p3

CMPUT 329 - Computer Organization and Architecture II126 Classical/ILP Optimizations in Predicated Code Example: Instruction Scheduling A:ld_ir1, r2, r3 (p2) B:addr4, r1, 4 (p2) C:ld_ir1, r5, 0 (p3) D:mulr6, r1, r7 (p3) What are the data dependencies in the code above? Depends on what we know about the relationship between p2, and p3.

CMPUT 329 - Computer Organization and Architecture II127 Classical/ILP Optimizations in Predicated Code Example: Instruction Scheduling A:ld_ir1, r2, r3 (p2) B:addr4, r1, 4 (p2) C:ld_ir1, r5, 0 (p3) D:mulr6, r1, r7 (p3) pk cm/cm p2 p3 For instance, if we know that p2 and p3 are mutually exclusive, we have this DDG: A B C D

CMPUT 329 - Computer Organization and Architecture II128 Classical/ILP Optimizations in Predicated Code Example: Instruction Scheduling A:ld_ir1, r2, r3 (p2) B:addr4, r1, 4 (p2) C:ld_ir1, r5, 0 (p3) D:mulr6, r1, r7 (p3) pk cm p2 p3 But if p2 implies p3, then have this DDG: A B C D

CMPUT 329 - Computer Organization and Architecture II129 Predicate-Specific Optimizations - Predicate Promotion - Branch Combining - Predicate Loop Peeling

CMPUT 329 - Computer Organization and Architecture II130 Predicate Promotion The idea it to speculate the execution of instructions by replacing their predicate by a less constrained predecessor predicate. Because the ancestor predicate is computed with fewer conditions, the execution of the promoted instruction is speculative. The advantage of predicate promotion is the reduction of the dependence chain in a hyperblock.

CMPUT 329 - Computer Organization and Architecture II131 Conditions for Simple Predicate Promotion The predicate of an instruction op(x) can be promoted to its predecessor predicate if all the following conditions are true: 1. op(x) is predicated 2. op(x) has a destination register 3. op(x) has a speculative version 4. there is a unique op(y) lexically before op(x) such that dest(y) = pred(x) 5. dest(x) is not live at op(y) 6. for any op(j) such that there is a path op(j)…op(y), dest(x)  dest(j) 7. It is profitable to promote op(x)

CMPUT 329 - Computer Organization and Architecture II132 Example of Predicate Promotion (qsort) 1 LA: ld_i r20, r24, r101 2 ld_i r23, r2, r102 3 pge p126(U), p127(U), r20, r23 4 LB: ld_i r6, r123, 0 (p126) 5 add r123, r123, 8 (p126) 6 add r9, r9, 1 (p126) 7 add r101, r101, 8 (p126) 8 LC: ld_i r6, r124, 8 (p127) 9 add r124, r124, 8 (p127) 10 add r124, r124, 8 (p127) 11 add r102, r102, 8 (p127) 12 LD: st_i r114, 0, r23 13 st_i r114, 4, r6 14 add r7, r7, 1 15 add r114, r114, 8 16 bge r9, r3, EXIT 17 LE: blt r8, r1, LA 1 LA: ld_i r20, r24, r101 2 ld_i r23, r2, r102 3 pge p126(U), p127(U), r20, r23 4 LB: ld_i r6, r123, 0 5 add r123, r123, 8 (p126) 6 add r9, r9, 1 (p126) 7 add r101, r101, 8 (p126) 8 LC: ld_i r60, r124, 8 8a mov r6, r60 (p127) 9 add r124, r124, 8 (p127) 10 add r124, r124, 8 (p127) 11 add r102, r102, 8 (p127) 12 LD: st_i r114, 0, r23 13 st_i r114, 4, r6 14 add r7, r7, 1 15 add r114, r114, 8 16 bge r9, r3, EXIT 17 LE: blt r8, r1, LA

CMPUT 329 - Computer Organization and Architecture II133 Branch Combining Problem: too many infrequently executed branches in a hyperblock 1 A: bge r1, r5, EXIT1 2 ld_c r3, r1, 0 3 beq r3, 10, EXIT2 4 beq r3, 0, EXIT3 5 bge r2, r6, EXIT4 6 st_c r2, 0, r3 7 add r1, r1, 1 8 add r2, r2, 1 9 jmp A Example: a loop in grep 14 4035 0 0

CMPUT 329 - Computer Organization and Architecture II134 Branch Combining Solution: replace a group of exit branches by a corresponding group of predicate define instructions. All predicate definitions write into the same predicate register using the OR-type semantics. The resultant predicate will be set to 1 if any of the exit branches were to be taken. Because not exiting the hyperblock is the most common case, the predicate will be false.

CMPUT 329 - Computer Organization and Architecture II135 Branch Combining

CMPUT 329 - Computer Organization and Architecture II136 Instruction Between Combined Branches Instructions between combined branches are speculated. For instructions that are between combined branches but cannot be speculated, the following must be done: (1) move the instructions below the combined exit branch in the hyperblock. (2) replicate these instructions in their original position with respect to the exit branches in the decode block.

CMPUT 329 - Computer Organization and Architecture II137 Backend Compilation with Hyperblocks Register Allocation Instruction Scheduling Classical Optim. ILP/Predicate-Specific Optimizations Hyperblock/Superblock Formation Classical Optim. Lcode generation PHG CFG Generator Equation Solver predicate relations dataflow information predicate aware

CMPUT 329 - Computer Organization and Architecture II1 CMPUT680 - Winter 2006 Topic I: Superblock and Hyperblock Formation José Nelson Amaral

Similar presentations

Presentation on theme: "CMPUT 329 - Computer Organization and Architecture II1 CMPUT680 - Winter 2006 Topic I: Superblock and Hyperblock Formation José Nelson Amaral"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CMPUT 329 - Computer Organization and Architecture II1 CMPUT680 - Winter 2006 Topic I: Superblock and Hyperblock Formation José Nelson Amaral

Similar presentations

Presentation on theme: "CMPUT 329 - Computer Organization and Architecture II1 CMPUT680 - Winter 2006 Topic I: Superblock and Hyperblock Formation José Nelson Amaral"— Presentation transcript:

Similar presentations

About project

Feedback