VLIW Machines Sima, Fountain and Kacsuk Chapter 6 CSE3304.

VLIW Machines Sima, Fountain and Kacsuk Chapter 6 CSE3304

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 VLIW Machines l Single stream of instructions –(one program counter and one control unit), l Very long instruction format, –enough control bits to directly and independently control the action of every functional unit in every cycle

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 VLIW Machines... l Large numbers of data paths and functional units, –control is planned at compile time –Some VLIW machines have no arbiters, queues or other hardware synchronisation mechanisms in the hardware.

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Common traits between VLIW and Superscalar? Register File Instructions EU Performance

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 FX EU Common traits between VLIW and Superscalar?... FX Register File FX Instructions FP Register File FX EU FP Instructions FX EU FP EU FP EU FP EU

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Differenecs between VLIW and Superscalar?... Cache Memory Fetch Unit Decode/Issue Unit Register File EU Multiple Instructions

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Issuing 1 instruction per cycle Cache Memory Fetch Unit Decode/Issue Unit Register File EU Multiple Instructions

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Issuing 2 instructions per cycle Cache Memory Fetch Unit Decode/Issue Unit Register File EU Multiple Instructions

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Issuing 4 instructions per cycle Cache Memory Fetch Unit Decode/Issue Unit Register File EU Multiple Instructions Issue unit decides at run time how many instructions to issue!

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 In VLIW machine choice is static Cache Memory Fetch Unit Register File EU Single Long Instruction

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Super Scalar Data Dependence l Consider R1 + R2  R3 R4 - R5  R6 load R7, Fred R7 * R1  R2 R1 + R2  R3 R4 - R5  R6 load R7, Fred R7 * R1  R2 Super Scalar single issue WAIT Super Scalar double issue R1 + R2  R3R4 - R5  R6 R7 * R1  R2 WAIT load R7, Fred Dynamic Decision not to co-issue

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 VLIW Data Dependence l Consider R1 + R2  R3 R4 - R5  R6 load R7, Fred R7 * R1  R2 VLIW double issue R1 + R2  R3R4 - R5  R6 R7 * R1  R2 WAIT load R7, FredNOP Static Decision not to co-issue

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Multiflow TRACE VLIW l Multiflow TRACE uses very sophisticated compilation techniques to detect low level parallelism. l The idea is that low level operations which can be executed at the same time are located and "packed" together into one instruction word. l When this instruction is executed, all of the operations will fire.

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 ALUs are controlled by one instruction stream. The TRACE machine splits this register file into integer and floating point files to meet the required bandwidth. Trace Machine Structure

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 TRACE Performance Multiflow TRACEVAXCray 7/20014/2008700XMP Implementation TechnologyCMOSECLECL Gate Speed3.5 nsecs3.5 nsecs1.5 nsecs1.3 nsecs Issue Rate130 nsecs130nsecs45 nsecs8 nsecs Linpack MFLOPS6.010.00.9724.0 Whetstones12605144003953 25700 Livermore Loop MFLOPS2.33.40.912.3 ANSYS Benchmark M3 Secs37572200n/a556 Note, this data is supplied by the manufacturer. There are many factors which affect performance, such as compiler options, vector length, etc.

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 TRACE Compiler l The TRACE compiler must not only generate code for the VLIW machine –schedule the hardware resources statically at compile time. –This is not always possible! l The compiler must schedule instructions so that as many ALUs are used as possible.

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Memory scheduling l The compiler must also schedule memory references so that there is no memory bank contention. l This guarantees arrival time of memory operands Interleaved Memory 1000 1001 1002 1003 1005 1006 1007 1004

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Compiler optimizations l Some optimizations are standard, and apply to many machines, but others are required for VLIW machines. l TRACE can use previous program execution traces to determine the most common branch directions. Compiler ExecutableSource Execute Trace

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Compiler optimizations l A branch which is taken the wrong way in a VLIW machine can be even more serious than in a scalar pipelined machine because there are many instructions packed together.

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Bad Branches on scalar machine Condition Known Evaluate BTA INCORRECT 2 instructions lost

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Bad Branches on VLIW Evaluate BTA INCORRECT 4 instructions lost Condition Known

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Compiler Optimisations... l Each subroutine is considered one at a time. l Classic optimisations such as: –loop invariant motion, –common sub-expression elimination are performed l The compiler then build a flow graph for the program so that data dependencies can be observed

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Conditional Branch Estimation l Compiler performs static branch estimation for loops l The sense of the IF statement is unknown at compiler time. l However, with trace data it is possible to guess very accurately. l It may even be possible to guess by using clever dataflow analysis and deduction

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Loop Unrolling l One optimisation that is required for high performance on a VLIW machine is static loop unravelling. l The bounds of a loop are reduced by some factor, and a number of loop iterations are statically unravelled.

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Loop Unrolling - An example DO 10 I = 1,10 A(I) = 0 10CONTINUE DO 10 I = 1,5 A(I*2) = 0 A(I*2-1) = 0 10CONTINUE A(1) = 0 A(2) = 0 A(3) = 0 A(4) = 0 A(5) = 0 A(6) = 0 A(7) = 0 A(8) = 0 A(9) = 0 A(10) = 0

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Loop Unrolling - A harder example DO 10 I = 6,25 A(I) = 0 B(I) = A(I-4) + A(I-5) 10CONTINUE

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 DO 10 I = 0,3 A ( I*5 + 6) = 0 B ( I*5 + 6) = A ( I*5 + 2) + A( I*5 + 1) A ( I*5 + 7) = 0 B ( I*5 + 7) = A ( I*5 + 3) + A( I*5 + 2) A ( I*5 + 8) = 0 B ( I*5 + 8) = A ( I*5 + 4) + A( I*5 + 3) A ( I*5 + 9) = 0 B ( I*5 + 9) = A ( I*5 + 5) + A( I*5 + 4) A ( I*5 + 10) = 0 10B ( I*5 + 10) = A ( I*5 + 6) + A( I*5 + 5) Now, if all of these statements can be executed concurrently, then the loop only needs to be performed 4 times. Loop Unrolling Loop Dependence

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Loop Unravelling l Once the loop has been unrolled, it is possible to build the dependence graph for the loop body. l This shows how to pack to instructions into the Very Long Instruction Word. l In our example all of the statements could be executed together if there were sufficient resources.

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Conditional Instructions l The Trace machine uses compare-predict operations rather than test operators. l The results can be written to general registers, which can avoid some of the branches in complex IF chains. CEQ R1,R2, BB(R2)Write BB with 1 if R1 == R2 else writeBB with 0 BRANCH (R3),LABEL The branch_test field selects R3

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Conditional Branches l Branches cause problems when instructions are packed into wide instructions. l Consider the following sequence: IF A < B GOTO 10 IF C < D GOTO 20 l These two statements are independent, and can be packed together.

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Conditional Branches... l But, what happens if both indicate a branch? Which one should be taken? l The TRACE machine uses a statically encoded priority scheme, so that the first one has priority of the second. IF A < B GOTO 10IF C < D GOTO 20 Takes Priority

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Compensation Code l One problem with statically unrolled loops, and packing instructions into one with a conditional branch, is that certain instructions may sometimes be executed even though a branch has been taken. –We have looked at some conventional solutions in pipelined machines. l The TRACE machine inserts code to undo mistakes when they occur.

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Compensation Code... l Consider: IF A < B GOTO 10 D = D + 1 l If these are packed together, then D will always be incremented. l IF A is usually >= B then this will usually be correct

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Compensation Code... l But is A < B it will be done and will be wrong. l The compiler could insert the following code at 10 10D = D - 1

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 TRACE Structure l On the TRACE machine each functional unit is split into an integer ALU and a floating point ALU. l Each FU required 256 bits of instruction. l A TRACE machine can have up to 4 Functional Units. l A fully configured TRACE machine will require 1024 bits of instruction per cycle

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 TRACE Structure... l Each Integer ALU contains 2 ALU/multipliers, and address translation TLB and a PC, as well as the integer registers. l Each floating point unit contains a floating point adder, a floating point multiplier, store and load registers.

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 TRACE Structure

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Instruction Format

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Instruction Encoding l In a highly parallel program each instruction will be packed with useful instructions. l In a program which does not have sufficient concurrency there will be many no-ops in the fields of the instructions. l Also, even highly parallel program may have regions which are low in concurrency.

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Instruction Encoding... l To combat the wasted space, a special memory format for the instructions in used. l Instructions with no-op fields are expanded on the fly when they are loaded into the instruction cache. Encoded Instruction Expanded Instruction

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Memory Subsystem l The TRACE machine uses an interleaved memory subsystem to achieve high throughput. l It does not rely in large caches and cache hit rates, but instead pipelines memory references (there is an instruction cache).

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Memory Subsystem... l There are multiple buses between the ALU's and the memory units, these are the F and I load buses and the F store buses. l The load buses are bi- directional, and the store buses are uni-directional. Load Buses Store Buses

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Memory Pipeline l Memory is accessed using a 8 stage pipeline. This is visible to the compiler. 0 The program says LD R1, R2, R3. R1 and R2 are added to form a virtual address. R2 may be replaced by a 6, 17 or 32 bit immediate constant. 1 The virtual address is looked up in the TLB 2 The physical address is sent over the buses to the memory controller. 3 The desired RAM bank starts cycling 4 RAM access continues 5 Data is returned from the memory controller 6 Data is sent over the buses 7 Data is written into the register file, and CPU can use data in R3.

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Memory Pipeline... VA= R 1 + R 2 TLB Lookup Adrs  Mem Memory Cycle Memory Cycle Bus Busy Data  BusData  R 3 VA= R 1 + R 2 TLB Lookup Adrs  Mem Memory Cycle Memory Cycle Bus Busy Data  Bus Must ensure that modules are different at compile time Must ensure that Buses are different at compile time

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Memory system l In a fully configured TRACE machine 4 memory references may be started in each beat, to 4 independently generated addresses. l The following rules must be followed: –At most one reference may be initiated on any one controller –No two references should be initiated which require the same bus to return the data l No two references should be initiated to the same RAM bank within 4 beats of each other l The available number of register file write ports should not be exceeded

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 The Disambiguator l A special module of the compiler, the disambiguator, determines whether memory references can be started in the same beat. l It must determine whether –address1 mod #modules = address2 mod #modules l The answers may be yes, no and maybe. –If the answer is no, then they are packed into the one instruction. –If the answer is yes or maybe, they are separated.

 David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 The Disambiguator l Consider the following cases: –accessing a single variable (compiler controlled address) –accessing parts of an array A(I) and A(I+1) A(I) and A(I+J) Interleaved Memory A(I) A(I+1) A(I+2) A(I+3) A(I+4)

VLIW Machines Sima, Fountain and Kacsuk Chapter 6 CSE3304.

Similar presentations

Presentation on theme: "VLIW Machines Sima, Fountain and Kacsuk Chapter 6 CSE3304."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

VLIW Machines Sima, Fountain and Kacsuk Chapter 6 CSE3304.

Similar presentations

Presentation on theme: "VLIW Machines Sima, Fountain and Kacsuk Chapter 6 CSE3304."— Presentation transcript:

Similar presentations

About project

Feedback