Presentation is loading. Please wait.

Presentation is loading. Please wait.

Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Example Architectures 6th Apr, 2006.

Similar presentations


Presentation on theme: "Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Example Architectures 6th Apr, 2006."— Presentation transcript:

1 Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Example Architectures 6th Apr, 2006

2 Anshul Kumar, CSE IITD slide 2 Execution model - some issues Register access within an instruction –interaction between reads and writes within an instruction to the same register Operation completion under exception –which operations are completed when an exception occurs Exposing pipeline latencies –what latency information the compiler has

3 Anshul Kumar, CSE IITD slide 3 Register access in an instruction Read sees the original value of the register –allows swap of two registers in a single instruction Read sees the value written by the write –a pair of operations that read and write a pair of registers can not be resolved Different operations that read and write the same register in an instruction are not allowed –parallel operations are not forced to execute in parallel

4 Anshul Kumar, CSE IITD slide 4 Operation completion under exception None complete All that can complete or all before the excepting operation complete Free-for-all Simplest Complex (determine what remains to be fixed up) No guarantees

5 Anshul Kumar, CSE IITD slide 5 Exposing pipeline latencies EQ model –the destination is written in a cycle which is known at compile time LEQ model –more permissive, allows some binary compatibility

6 Anshul Kumar, CSE IITD slide 6 VLIW Examples IA-64 and Itanium: HP and Intel Trimedia: Philips Transmeta Crusoe DSPs: Texas Instruments, Analog Devices

7 Anshul Kumar, CSE IITD slide 7 IA-64 Register Model 128 general purpose registers64 bit 128 floating point registers82 bit 64 predicate registers1 bit 8 branch registers (indirect branch)64 bit Registers for system control, memory mapping, performance counters, communication with OS

8 Anshul Kumar, CSE IITD slide 8 Register Stack GPRs 0-31 always available GPRs 32-127 used as a stack GPRs and FPRs support register rotation for SW pipelining OUT LOCAL (frame i -1) OUT LOCAL (frame i)

9 Anshul Kumar, CSE IITD slide 9 IA-64 Execution Units ExecutionInstructionDescription UnitType I-unitAArithmetic (integer) I non-ALU int (shifts, tests, move) M-unitAArithmetic (integer) MMemory (load/store) F-unitFFloating point B-unitBBranches, calls, loops L+XL+XExtended immediates (executed by either B or I units)

10 Anshul Kumar, CSE IITD slide 10 Flexibility + explicit parallelism Compiler forms groups of instructions which can be executed in parallel if execution resources are available Instructions in a group may be scheduled in one or more cycles, depending upon resource availability

11 Anshul Kumar, CSE IITD slide 11 Instruction Formats Instructions are encoded in 128 bit bundles Each bundle = 5 bit template + 3  41 bit instruction 5 bit template field specifies execution unit types required for the 3 instructions and position of stops, if any stops indicate the boundaries of instruction groups

12 Anshul Kumar, CSE IITD slide 12 Template examples TemplateSlot 0Slot 1Slot 2 0 M I I 1 M I I 2 M I I 3 M I I 4 M L X 5 M L X 8 M M I 9 M M I

13 Anshul Kumar, CSE IITD slide 13 Example Schedule 1 TemplateSlot 0Slot 1Slot 2Cycle 9: MMILD F0,0(R1) LD F6,-8(R1) 1 14: MMFLD F10,-16(R1) LD F14,-24(R1) ADD F4,F0,F2 3 15: MMFLD F18,-32(R1) LD F22,-40(R1) ADD F8,F6,F2 4 15: MMFLD F26,-48(R1) SD F4,0(R1) ADD F12,F10,F2 6 15: MMFSD F8,-8(R1) SD F12,-16(R1) ADD F16,F14,F2 9 15: MMFSD F16,-24(R1) ADD F20,F18,F2 12 15: MMFSD F20,-32(R1) ADD F24,F22,F2 15 15: MMFSD F24,-40(R1) ADD F28,F26,F2 18 28: MFBSD F28,-48(R1) ADD R1,R1,-56 BNE R1,R2,Loop 21

14 Anshul Kumar, CSE IITD slide 14 Example Schedule 2 TemplateSlot 0Slot 1Slot 2Cycle 8: MMILD F0,0(R1) LD F6,-8(R1) 1 9: MMILD F10,0(R1) LD F6,-8(R1) 2 14: MMFLD F18,-16(R1) LD F14,-24(R1) ADD F4,F0,F2 3 14: MMFLD F26,-16(R1) ADD F8,F10,F2 4 15: MMF ADD F12,F14,F2 5 14: MMFSD F4,0(R1) ADD F16,F18,F2 6 14: MMFSD F8,-8(R1) ADD F20,F14,F2 7 15: MMF SD F12,-16(R1) ADD F24,F22,F2 8 14: MMF SD F16,-24(R1) ADD F28,F26,F2 9 9: MMISD F20,-32(R1) SD F24,-40(R1) 11 28: MFBSD F28,-48(R1) ADD R1,R1,-56 BNE R1,R2,Loop 12

15 Anshul Kumar, CSE IITD slide 15 Predication Support Almost all instructions predicated 6 bit field specifies predicate register Predicate registers are set by test instructions

16 Anshul Kumar, CSE IITD slide 16 Speculation Support Control speculation using poison bit approach –One additional bit in GPRs - NaT (not a thing) –NaTVal in FPRs –Registers with NaT or NaTVal can’t be stored –special instructions to save and restore registers with poison bits/values Load/store speculation using advanced load instruction and ALAT table with associative look up

17 Anshul Kumar, CSE IITD slide 17 Itanium Processor Introduced in 2001 with 800MHz clock 3 level cache: first split, first 2 on-chip 2 I units, 2 M units, 3 B units, 2 F units 10 stage pipeline pre-fetch buffer with 8 bundles : 2 bundles pre- fetched per cycle up to 2 bundles issued at a time: up to 6 instructions distributed to 9 execution units, with register renaming (rotation and stacking) Good FP performance but not integer

18 Anshul Kumar, CSE IITD slide 18 Trimedia TM32 Designed for embedded applications Classic VLIW architecture, completely static scheduling 5 operation slots per instruction –each specifies an operation or immediate field no hazard detection hardware compressed code stored in memory and cache, decompressed during fetch each operation can be individually predicated in an instruction with multiple branches, at most one predicate can be true no virtual memory

19 Anshul Kumar, CSE IITD slide 19 Trimedia Function Units 23 function units of 11 different types –min latency 0 (integer ALU) –max latency 16 (FP divide and square root) a function unit can be specified by only certain instruction slots –ALU (all), DMem (4, 5), Branch (2, 3, 4), DSPALU (1, 3), FALU (1, 4), FTough (2)

20 Anshul Kumar, CSE IITD slide 20 Transmeta Crusoe Designed for low power applications like mobile PC, mobile internet appliances compatibility with x86 through translating software 500 MHz to 1 GHz, 5 to 7 W power consumption 64 bit (2 operations) and 128 bit (4 operations) versions, 64 integer registers [new 256 bit Efficeon] Operation slot types: ALU, compute (int/fp/mm), Memory, Branch, Immediate Support for speculative re-ordering: shadow register file, program-controlled store buffer, memory alias detection, conditional move


Download ppt "Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Example Architectures 6th Apr, 2006."

Similar presentations


Ads by Google