4Instruction Level Parallelism (ILP) Suppose we have an expression of the form x = (a+b) * (c-d)Assuming a,b,c & d are in registers, this might turn intoADD R0, R2, R3SUB R1, R4, R5MUL R0, R0, R1STR R0, x
5ILP (cont)The MUL has a dependence on the ADD and the SUB, and the STR has a dependence on the MULHowever, the ADD and SUB are independentIn theory, we could execute them in parallel, even out of orderADD R0, R2, R3SUB R1, R4, R5MUL R0, R0, R1STR R0, x
6The Data Flow GraphWe can see this more clearly if we draw the data flow graphR R3 R4 R5As long as R2, R3,R4 & R5 are available,We can execute theADD & SUB in parallelADDSUBMULx
7Amount of ILP? This is obviously a very simple example However, real programs often have quite a few independent instructions which could be executed in parallelExact number is clearly program dependent but analysis has shown that maybe 4 is not uncommon (in parts of the program anyway).
8How to Exploit?We need to fetch multiple instructions per cycle – wider instruction fetchNeed to decode multiple instructions per cycleBut must use common registers – they are logically the same registersNeed multiple ALUs for executionBut also access common data cache
9Dual Issue Pipeline Structure Two instructions can now execute in parallel(Potentially) double the execution rateCalled a ‘Superscalar’ architectureI I2InstructionCacheRegister BankMUXALUCacheDataPCMUXALU
10Register & Cache Access Note the access rate to both registers & cache will be doubledTo cope with this we may need a dual ported register bank & dual ported cache.This can be done either by duplicating access circuitry or even duplicating whole register & cache structure
11Selecting Instructions To get the doubled performance out of this structure, we need to have independent instructionsWe can have a ‘dispatch unit’ in the fetch stage which uses hardware to examine the instruction dependencies and only issue two in parallel if they are independent
12Instruction order If we had Issued in pairs as above ADD R1,R1,R0MUL R0,R1,R1ADD R3,R4,R5MUL R4,R3,R3Issued in pairs as aboveWe wouldn’t be able to issue any in parallel because of dependencies
13Compiler Optimisation But if the compiler had examined dependencies and producedADD R1,R1,R0ADD R3,R4,R5MUL R0,R1,R1MUL R4,R3,R3We can now execute pairs in parallel (assuming appropriate forwarding logic)
14Relying on the Compiler If compiler can’t manage to reorder the instructions, we still need hardware to avoid issuing conflictsBut if we could rely on the compiler, we could get rid of expensive checking logicThis is the principle of VLIW (Very Long Instruction Word)Compiler must add NOPs if necessary
15Out of Order ExecutionThere are arguments against relying on the compilerLegacy binaries – optimum code tied to a particular hardware configuration‘Code Bloat’ in VLIW – useless NOPsInstead rely on hardware to re-order instructions if necessaryComplex but effective
16Out of Order ExecutionProcessor Level OOE can only happen once the instructions are loaded.I I2InstructionCacheRegister BankMUXALUCacheDataPCMUXALU
17Programmer Assisted ILP / Vector Instructions Linear Algebra operations such as Vector Product, Matrix Multiplication have LOTS of parallelismThis can be hard to detect in languages like CInstructions can be too separated for hardware detection.Programmer can use types such as float4
18Limits of ILPModern processors are up to 4 way superscalar (but rarely achieve 4x speed)Not much beyond thisHardware complexityLimited amounts of ILP in real programsLimited ILP not surprising, conventional programs are written assuming a serial execution model – what next?