Download presentation
Presentation is loading. Please wait.
Published bySimon Willems Modified over 6 years ago
1
Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches
Henk Corporaal TUEindhoven 2007
2
Topics Static branch prediction and speculation
Basic compiler techniques Multiple issue architectures Advanced compiler support techniques Loop-level parallelism Software pipelining Hardware support for compile-time scheduling EPIC: IA-64 9/18/2018 ACA H.Corporaal
3
We need Static Branch Prediction
We discussed previously dynamic branch prediction This does not help the compiler !!! We need Static Branch Prediction 9/18/2018 ACA H.Corporaal
4
Static Branch Prediction and Speculation
Static branch prediction useful for code scheduling Example: ld r1,0(r2) sub r1,r1,r3 # hazard beqz r1,L or r4,r5,r6 addu r10,r4,r3 L: addu r7,r8,r9 If the branch is taken most of the times and since r7 is not needed on the fall-through path, we could move addu r7,r8,r9 directly after the ld If the branch is not taken most of the times and assuming that r4 is not needed on the taken path, we could move or r4,r5,r6 after the ld 9/18/2018 ACA H.Corporaal
5
Static Branch Prediction Methods
Always predict taken Average misprediction rate for SPEC: 34% (9%-59%) Backward branches predicted taken, forward branches not taken In SPEC, most forward branches are taken, so always predict taken is better Profiling Run the program and profile all branches. If a branch is taken (not taken) most of the times, it is predicted taken (not taken) Behavior of a branch is often biased to taken or not taken Average misprediction rate for SPECint: 15% (11%-22%), SPECfp: 9% (5%-15%) More advanced control flow restructuring 9/18/2018 ACA H.Corporaal
6
Basic compiler techniques
Dependences limit ILP Stalls Scheduling to avoid stalls Loop unrolling: more parallelism 9/18/2018 ACA H.Corporaal
7
Dependencies Limit ILP
C loop: for (i=1; i<=1000; i++) x[i] = x[i] + s; MIPS assembly code: ; R1 = &x[1] ; R2 = &x[1000]+8 ; F2 = s Loop: L.D F0,0(R1) ; F0 = x[i] ADD.D F4,F0,F2 ; F4 = x[i]+s S.D 0(R1),F4 ; x[i] = F4 ADDI R1,R1,8 ; R1 = &x[i+1] BNE R1,R2,Loop ; branch if R1!=&x[1000]+8 9/18/2018 ACA H.Corporaal
8
Schedule this on a MIPS Pipeline
FP operations are mostly multicycle The pipeline must be stalled if an instruction uses the result of a not yet finished multicycle operation We’ll assume the following latencies Producing Consuming Latency instruction instruction (clock cycles) FP ALU op FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 Load double Store double 0 9/18/2018 ACA H.Corporaal
9
Where to Insert Stalls How would this loop be executed on the MIPS FP pipeline? Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D 0(R1),F4 ADDI R1,R1,8 BNE R1,R2,Loop 9/18/2018 ACA H.Corporaal
10
Where to Insert Stalls How would this loop be executed on the MIPS FP pipeline? 10 cycles per iteration Loop: L.D F0,0(R1) stall ADD.D F4,F0,F2 S.D 0(R1),F4 ADDI R1,R1,8 BNE R1,R2,Loop 9/18/2018 ACA H.Corporaal
11
Code Scheduling to Avoid Stalls
Can we reorder the order of instruction to avoid stalls? Execution time reduced from 10 to 6 cycles per iteration But only 3 instructions perform useful work, rest is loop overhead Loop: L.D F0,0(R1) ADDI R1,R1,8 ADD.D F4,F0,F2 stall BNE R1,R2,Loop S.D -8(R1),F4 9/18/2018 ACA H.Corporaal
12
Loop Unrolling: increasing ILP
At source level: for (i=1; i<=1000; i++) x[i] = x[i] + s; for (i=1; i<=1000; i=i+4) { x[i] = x[i] + s; x[i+1] = x[i+1]+s; x[i+2] = x[i+2]+s; x[i+3] = x[i+3]+s; } Drawbacks: loop unrolling increases code size more registers needed MIPS code after scheduling: Loop: L.D F0,0(R1) L.D F6,8(R1) L.D F10,16(R1) L.D F14,24(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 ADD.D F12,F10,F2 ADD.D F16,F14,F2 S.D 0(R1),F4 S.D 8(R1),F8 ADDI R1,R1,32 SD (R1),F12 BNE R1,R2,Loop SD (R1),F16 9/18/2018 ACA H.Corporaal
13
Multiple issue architectures
How to get CPI < 1 ? Superscalar Statically scheduled Dynamically scheduled (see previous lecture) VLIW ? SIMD / Vector ? 9/18/2018 ACA H.Corporaal
14
Multiple-Issue Processors
Vector Processing: Explicit coding of independent loops as operations on large vectors of numbers Multimedia instructions being added to many processors Multiple-Issue Processors Superscalar: varying no. instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo) (dynamic issue capability) IBM PowerPC, Sun UltraSparc, DEC Alpha, Pentium III/4 VLIW (very long instr. word): fixed number of instructions (4-16) scheduled by the compiler (static issue capability) Intel Architecture-64 (IA-64, Itanium), TriMedia, TI C6x Anticipated success of multiple instructions led to Instructions Per Cycle (IPC) metric instead of CPI 9/18/2018 ACA H.Corporaal
15
Statically Scheduled Superscalar
Static Superscalar 2-issue processor: 1 Integer & 1 FP – Fetch 64-bits/clock cycle; Int on left, FP on right – Can only issue 2nd instruction if 1st instruction issues – More ports needed for FP register file to execute FP load & FP op in parallel Type Pipe Stages Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB 1 cycle load delay impacts the next 3 instructions ! 9/18/2018 ACA H.Corporaal
16
Example Integer instruction FP instruction Cycle
for (i=1; i<=1000; i++) a[i] = a[i]+s; Integer instruction FP instruction Cycle L: LD F0,0(R1) 1 LD F6,8(R1) 2 LD F10,16(R1) ADDD F4,F0,F2 3 LD F14,24(R1) ADDD F8,F6,F2 4 LD F18,32(R1) ADDD F12,F10,F2 5 SD 0(R1),F4 ADDD F16,F14,F2 6 SD 8(R1),F8 ADDD F20,F18,F2 7 SD 16(R1),F ADDI R1,R1, SD -16(R1),F BNE R1,R2,L SD -8(R1),F Load: 1 cycle latency ALU op: 2 cycles latency 2.4 cycles per element vs. 3.5 for ordinary MIPS pipeline Int and FP instructions not perfectly balanced 9/18/2018 ACA H.Corporaal
17
Multiple Issue Issues While Integer/FP split is simple for the HW, get CPI of 2 only for programs with: Exactly 50% FP operations AND no hazards More complex decode and issue: Even 2-issue superscalar => examine 2 opcodes, 6 register specifiers, and decide if 1 or 2 instructions can issue (N-issue ~O(N2) comparisons) Register file for 2-issue superscalar: needs 2x reads and 1x writes/cycle Rename logic: must be able to rename same register multiple times in one cycle! For instance, consider 4-way issue: add r1, r2, r3 add p11, p4, p7 sub r4, r1, r2 sub p22, p11, p4 lw r1, 4(r4) lw p23, 4(p22) add r5, r1, r2 add p12, p23, p4 Imagine doing this transformation in a single cycle! Result buses: Need to complete multiple instructions/cycle Need multiple buses with associated matching logic at every reservation station. 9/18/2018 ACA H.Corporaal
18
VLIW Processors Ld/st 1 Ld/st 2 FP 1 FP 2 Int
Superscalar HW too difficult to build => let compiler find independent instructions and pack them in one Very Long Instruction Word (VLIW) Example: VLIW processor with 2 ld/st units, two FP units, one integer/branch unit, no branch delay LD F0,0(R1) LD F6,8(R1) LD F10,16(R1) LD F14,24(R1) LD F18,32(R1) LD F22,40(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 LD F26,48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 ADD F20,F18,F2 ADD F24,F22,F2 SD 0(R1),F4 SD 8(R1),F8 ADDD F28,F26,F2 SD 16(R1),F12 SD 24(R1),F16 SD 32(R1),F20 SD 40(R1),F ADDI R1,R1,56 SD –8(R1),F BNE R1,R2,L Ld/st 1 Ld/st 2 FP 1 FP 2 Int 9/18/2018 ACA H.Corporaal
19
Superscalar versus VLIW
VLIW advantages: Much simpler to build. Potentially faster VLIW disadvantages and proposed solutions: Binary code incompatibility Object code translation or emulation Less strict approach (EPIC, IA-64, Itanium) Increase in code size, unfilled slots are wasted bits Use clever encodings, only one immediate field Compress instructions in memory and decode them when they are fetched Lockstep operation: if the operation in one instruction slot stalls, the entire processor is stalled Less strict approach 9/18/2018 ACA H.Corporaal
20
Advanced compiler support techniques
Loop-level parallelism Software pipelining Global scheduling (across basic blocks) 9/18/2018 ACA H.Corporaal
21
Detecting Loop-Level Parallelism
Loop-carried dependence: a statement executed in a certain iteration is dependent on a statement executed in an earlier iteration If there is no loop-carried dependence, then its iterations can be executed in parallel for (i=1; i<=100; i++){ A[i+1] = A[i]+C[i]; /* S1 */ B[i+1] = B[i]+A[i+1]; /* S2 */ } S1 S2 A loop is parallel the corresponding dependence graph does not contain a cycle 9/18/2018 ACA H.Corporaal
22
Finding Dependences Is there a dependence in the following loop?
for (i=1; i<=100; i++) A[2*i+3] = A[2*i] + 5.0; Affine expression: an expression of the form a*i + b (a, b constants, i loop index variable) Does the following equation have a solution? a*i + b = c*j + d GCD test: if there is a solution, then GCD(a,c) must divide d-b Note: Because the GCD test does not take the loop bounds into account, there are cases where the GCD test says “yes, there is a solution” while in reality there isn’t 9/18/2018 ACA H.Corporaal
23
Software Pipelining We have already seen loop unrolling
Software pipelining is a related technique that that consumes less code space. It interleaves instructions from different iterations instructions in one iteration are often dependent on each other Iteration 0 Iteration 1 Iteration 2 Software- pipelined iteration instructions Steady state kernel 9/18/2018 ACA H.Corporaal
24
Simple Software Pipelining Example
L: l.d f0,0(r1) # load M[i] add.d f4,f0,f2 # compute M[i] s.d f4,0(r1) # store M[i] addi r1,r1,-8 # i = i-1 bne r1,r2,L Software pipelined loop: L: s.d f4,16(r1) # store M[i] add.d f4,f0,f2 # compute M[i-1] l.d f0,0(r1) # load M[i-2] addi r1,r1,-8 Need hardware to avoid the WAR hazards 9/18/2018 ACA H.Corporaal
25
Global code scheduling
Loop unrolling and software pipelining work well when there are no control statements (if statements) in the loop body -> loop is a single basic block Global code scheduling: scheduling/moving code across branches: larger scheduling scope When can the assignments to B and C be moved before the test? A[i]=A[i]+B[i] T F A[i]=0? B[i]= X C[i]= 9/18/2018 ACA H.Corporaal
26
Which scheduling scope?
Trace Superblock Decision Tree Hyperblock/region 9/18/2018 ACA H.Corporaal
27
Comparing scheduling scopes
Extended basic block scheduling: Scope Why choose a specific scope? The first four are all acyclic; i.e. no back edges Trace (Fisher, IEEE trans. on comp. 1981) Use standard list scheduling However: lots of bookkeeping and code copying for code motions past fork and join points Superblock (Hwu, e.a. Journal of Supercomputing, may 1993) Easier than trace scheduling no join points -> no copying during scheduling only upward code motion -> no motion past forks Tail duplication needed Decision tree Follow multiple paths No join points \ra no complex bookkeeping No incoming edges \ra no code duplication during scheduling Each block with multiple entries becomes root -> trees are small -> tail duplication needed Hyperblock Superblock with multiple paths {\bf if-converted} Single entry Re-if-conversion for architectures without guarded execution {Warter, e.a., conf. on PLDI (progr. lang. design and impl.), jun'93} Region (Bernstein and Rodey: conf. on PLDI, Nov'91) Correspond to bodies of natural loops Regions can be nested (hierarchical scheduling) No profiling needed for region selection (this in contrast to the former scopes) Very large scope (encompasses the other approaches) Loop Keep multiple iterations active at a time Different approaches to be discussed later on Disadvantages of Trace and Superblock scheduling: Follow only one path Require high completion ratio: i.e.\ if first block is executed, all blocks should have high probability to be executed Requires biased branches and accurate (static) branch prediction Trace Sup. block Hyp. Dec. Tree Region Multiple exc. paths No Yes Side-entries allowed Join points allowed Code motion down joins Yes No Must be if-convertible Tail dup. before sched. 9/18/2018 ACA H.Corporaal
28
Scheduling scope creation
Partitioning a CFG into scheduling scopes: Superblock B C F E’ D’ G’ A E D G tail duplication B C E F D G A Trace 9/18/2018 ACA H.Corporaal
29
Scheduling scope creation
Partitioning a CFG into scheduling scopes: B C E’ F’ D’ G’’ A E D G Decision Tree tail duplication F G’ B C E F D G A Hyperblock/ region 9/18/2018 ACA H.Corporaal
30
Trace Scheduling Find the most likely sequence of basic blocks that will be executed consecutively (trace selection) Optimize the trace as much as possible (trace compaction) move operations as early as possible in the trace pack the operations in as few VLIWs as possible additional bookkeeping code may be necessary on exit points of the trace 9/18/2018 ACA H.Corporaal
31
Hardware support for compile-time scheduling
Predication (discussed already) Deferred exceptions Speculative loads 9/18/2018 ACA H.Corporaal
32
Predicated Instructions
Avoid branch prediction by turning branches into conditional or predicated instructions: If false, then neither store result nor cause exception Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move; PA-RISC can annul any following instr. IA-64/Itanium: conditional execution of any instruction Examples: if (R1==0) R2 = R3; CMOVZ R2,R3,R1 if (R1 < R2) SLT R9,R1,R2 R3 = R1; CMOVNZ R3,R1,R9 else CMOVZ R3,R2,R9 R3 = R2; 9/18/2018 ACA H.Corporaal
33
Assuming then-part is almost always executed
Deferred Exceptions Assuming then-part is almost always executed ld r1,0(r3) # load A bnez r1,L1 # test A ld r1,0(r2) # then part; load B j L2 L1: addi r1,r1,4 # else part; inc A L2: st r1,0(r3) # store A if (A==0) A = B; else A = A+4; ld r1,0(r3) # load A ld r9,0(r2) # speculative load B beqz r1,L3 # test A addi r9,r1,4 # else part L3: st r9,0(r3) # store A What if the load generates a page fault? What if the load generates an “index-out-of-bounds” exception? 9/18/2018 ACA H.Corporaal
34
HW supporting Speculative Loads
Speculative load (sld): does not generate exceptions Speculation check instruction (speck): check for exception. The exception occurs when this instruction is executed. ld r1,0(r3) # load A sld r9,0(r2) # speculative load of B bnez r1,L1 # test A speck 0(r2) # perform exception check j L2 L1: addi r9,r1,4 # else part L2: st r9,0(r3) # store A 9/18/2018 ACA H.Corporaal
35
Avoiding superscalar complexity
An alternative: EPIC (explicit parallel instr computer) Best of both worlds? Superscalar: expensive but binary compatible VLIW: simple, but not compatible 9/18/2018 ACA H.Corporaal
36
EPIC Architecture: IA-64
Explicit Parallel Instruction Computer IA-64 -> Merced (2001), McKinley (2002), Montecite (2 core, 2006), Tukwila (4-core 2008) architecture is now called Itanium Register model: bit int x bits, stack, rotating bit floating point, rotating bit booleans bit branch target address system control registers 9/18/2018 ACA H.Corporaal
37
(2002) 9/18/2018 ACA H.Corporaal
38
EPIC Architecture: IA-64
Instructions grouped in 128-bit bundles 3 * 41-bit instruction 5 template bits, indicate type and stop location Each 41-bit instruction starts with 4-bit opcode, and ends with 6-bit guard (boolean) register-id 9/18/2018 ACA H.Corporaal
39
9/18/2018 ACA H.Corporaal
40
EPIC Architecture: IA-64
IA-64 looks like a VLIW However: Instructions contain only one operation; compiler can indicate that successive instructions can be executed in parallel HW does the Operation – FU binding Pipeline latencies not visible in the ISA These measures make the ISA independent of #FUs and pipeline latencies ISA supports multiple implementations 9/18/2018 ACA H.Corporaal
41
Montecito 2006 9/18/2018 ACA H.Corporaal
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.