Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

Similar presentations


Presentation on theme: "CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster."— Presentation transcript:

1 CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster

2 5/30/3008CS 211 Lecture 62 Global Code Scheduling Used where body of an unrolled loop contains internal code flow –Will require moving instructions across branches Objective is to compact a code fragment with internal control structure into the shortest possible sequence that preserves the data and control dependences. –Data dependences require imposing a partial ordering –Control dependences dictate what code cannot be moved across control boundaries

3 5/30/3008CS 211 Lecture 63 Figure 4.8 A Common Control Flow

4 5/30/3008CS 211 Lecture 64 Global Scheduling Desire is to move the assignments to B and C earlier in the execution sequence, before the test of A. –Perform these movements ensuring that neither the data flow nor the exception behavior is changes Exception behavior – compilers will not move certain classes of instructions

5 5/30/3008CS 211 Lecture 65 Global Scheduling (continued) Can be extremely complex –Movement of instructions can have impact in many areas –Decision of how far to move the code can be complex –Trade-offs can occur that are difficult to program for –Selecting the candidate instruction to move is difficult Candidate methods –Trace scheduling: Focusing on critical path –Superblocks

6 5/30/3008CS 211 Lecture 66 Trace Scheduling Useful for processors with a large number of issues per clock, where simple loop unrolling may not be sufficient to uncover enough ILP to keep the processor busy. Take the costs of moving code on the less frequent paths –Best used where profile information indicates significant differences in frequency between different paths and where the profile is correct independent of particular inputs to the program –Only certain classes of programs meet these criteria

7 5/30/3008CS 211 Lecture 67 Trace Scheduling (Two Steps) Trace selection –Tries to find a likely sequence of basic blocks where the operations will be put together into a smaller number of instructions – this sequence being called a trace –Long traces generated by loop unrolling, since loop branches taken with high probability –Other static (by compiler) branch prediction is done –Figure 4.9 shows loop unrolled 4 times Trace compaction –Squeeze the trace into a smaller number of wide instructions Code scheduling – move operations as early as possible in a trace

8 5/30/3008CS 211 Lecture 68 Unrolled inner loop

9 5/30/3008CS 211 Lecture 69 Trace Scheduling (continued) Branches are viewed as jumps into or out of the selected trace – the selected trace assumed to be the most probable path. When code moved across trace entry and exit points, additional code is generated to keep up with alternate conditions An alternate code sequence is also generated

10 5/30/3008CS 211 Lecture 610 Superblocks In trace scheduling, entries and exits into the middle of the trace cause complications, with the compiler having to generate much additional code to compensate for incorrect branch speculation Superblocks are formed similarly to traces, but are a form of extended basic blocks, restricted to a single entry but allowing multiple exits. –Only code movement across an exit generates an alternative code set. –Generated by using tail duplication as in Figure 4.10

11 5/30/3008CS 211 Lecture 611 Figure 4.10 Superblock exit with n=4 Superblock exit with n=3 Superblock exit with n=2 Superblock exit with n=1 Execute n times

12 5/30/3008CS 211 Lecture 612 Hardware Support for Compile Time Exposure of Parallelism Loop unrolling, software pipelining, and trace scheduling work well only if behavior of branches is predictable at compile time. –Most predictable are loops (backward branches) –Programming style (how if then else blocks are coded with respect to the if test) affects these methods Techniques to overcome these limitations –Conditional or predicated instructions –Compiler speculation

13 5/30/3008CS 211 Lecture 613 Conditional or Predicated Instructions An instruction refers to a condition, which is evaluated as part of an instruction execution. –If condition is true, instruction executed –If condition is false, instruction continues as if it were a NOP –Many new architectures have these Control dependence converted to data dependence Number of multiple branches can be reduced Difficult to predict branches can be eliminated Conditional move limitations –May need to introduce many conditional moves If (A==0) {S=T;} ;normally compile to BNEZR1,L ADDUR2,R3,R0 L: ;above code gets changed to CMOVZR2,R3,R1

14 5/30/3008CS 211 Lecture 614 Conditional or Predicated Instructions Some architectures support full predication, whereby execution of all instructions is controlled by a predicate –When predicate is false, instruction is a no-op –Allows conversion of large blocks of code that are branch dependent –Can eliminate non-loop branches –Allows movement constrained only by predicate data Correct behavior –Predicated code must not generate an exception where the predicate is false

15 5/30/3008CS 211 Lecture 615 Constraints on Usefulness of Conditional or Predicated Instructions Predicated instructions that are annulled still go through fetch and in most processors some functional execution time –Movement across branches slows program down where branch would not have been taken Predicated instructions most useful when predicate can be evaluated early Use of conditional instructions can be limited when the control flow involves more than a simple alternative sequence –Moving an instruction across multiple branches makes it conditional on both branches of prior logic Conditional instructions may have some speed penalty compared with unconditional instructions Many architectures have included a few simple conditional instructions but only a few include conditional versions for most instructions (IA-64)

16 5/30/3008CS 211 Lecture 616 Compiler Speculation with Hardware Support Ambitious speculation requires –Compiler to find instructions that, with possible use of register renaming, can be speculatively moved and not affect program data flow (yes obviously the compiler) and we have seen this –Ability to ignore exceptions in speculated instructions until it is known whether or not the exceptions occur (information can only be know at run time here, so it’s the hardware’s responsibility) –Ability to speculatively interchange loads and stores or stores and store which may have address conflicts (information can only be know at run time here, so it’s the hardware’s responsibility)

17 5/30/3008CS 211 Lecture 617 Hardware Support for Preserving Exception Behavior 4 Approaches HW and OS cooperatively ignore exceptions for speculative instructions –Preserves exception behavior for correct programs but not for incorrect ones –Generally unacceptable but has been used, incorrect programs continue but get incorrect results Speculative instructions that never raise exceptions are used, with checks introduced to see when exceptions should occur A set of status bits are attached to result register and these “poison bits” are written by speculated instructions when instructions cause an exception (one bit) –Note stores are never speculated A mechanism is provided to indicate that an instruction is speculative, and the hardware buffers the instruction result until it is certain that the instruction is no longer speculative.

18 5/30/3008CS 211 Lecture 618 Fallacies and Pitfalls Fallacy – There is a simple approach to multiple-issue processors that yields high performance without a significant investment in silicon area or design complexity –Significant effort to find this fountain of gold –Exhibited phenomena are increased gaps between peak and sustained performance as issue rate increases –Compiling for processors with significant amounts of ILP has become complex

19 5/30/3008CS 211 Lecture 619 Itanium Separate attachment

20 5/30/3008CS 211 Lecture 620 SPECInt Benchmark

21 5/30/3008CS 211 Lecture 621 SPECfp Comparing Itanium, Alpha 21264 and Pentium 4


Download ppt "CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster."

Similar presentations


Ads by Google