Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Example Architectures 6th Apr, 2006.

Slides:



Advertisements
Similar presentations
Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.
Advertisements

1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
Instruction Level Parallelism 2. Superscalar and VLIW processors.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Introduction 23rd Mar, 2006.
1 Lecture 18: VLIW and EPIC Static superscalar, VLIW, EPIC and Itanium Processor (First introduce fast and high- bandwidth L1 cache design)
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Computer Architecture Lec 8 – Instruction Level Parallelism.
DAP.F96 1 Lecture 4: Hazards, Introduction to Compiler Techniques, Chapter 2.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 19 - Pipelined.
Chapter 15 IA-64 Architecture No HW, Concentrate on understanding these slides Next Monday we will talk about: Microprogramming of Computer Control units.
CPSCS614:Graduate Computer Architecture Static Pipelining #2 and Goodbye to Computer Architecture Prof. Lawrence Rauchwerger Based on Lectures by Prof.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 3.
CIS 629 Fall 2002 Multiple Issue/Speculation Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to utilize.
Chapter 15 IA-64 Architecture. Reflection on Superscalar Machines Superscaler Machine: A Superscalar machine employs multiple independent pipelines to.
Chapter 21 IA-64 Architecture (Think Intel Itanium)
IA-64 Architecture (Think Intel Itanium) also known as (EPIC – Extremely Parallel Instruction Computing) a new kind of superscalar computer HW 5 - Due.
COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.
Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)
IA-64 ISA A Summary JinLin Yang Phil Varner Shuoqi Li.
The Arrival of the 64bit CPUs - Itanium1 นายชนินท์วงษ์ใหญ่รหัส นายสุนัยสุขเอนกรหัส
Multiple Issue Processors: Superscalar and VLIW
Transmeta and Dynamic Code Optimization Ashwin Bharambe Mahim Mishra Matthew Rosencrantz.
Hardware Support for Compiler Speculation
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
Introducing The IA-64 Architecture - Kalyan Gopavarapu - Kalyan Gopavarapu.
IA-64 Architecture RISC designed to cooperate with the compiler in order to achieve as much ILP as possible 128 GPRs, 128 FPRs 64 predicate registers of.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
Transmeta’s New Processor Another way to design CPU By Wu Cheng
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102),
Unit II Intel IA-64 and Itanium Processor By N.R.Rejin Paul Lecturer/VIT/CSE CS2354 Advanced Computer Architecture.
Use of Pipelining to Achieve CPI < 1
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
CS 352H: Computer Systems Architecture
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
COSC3330 Computer Architecture
Henk Corporaal TUEindhoven 2009
CSL718 : VLIW - Software Driven ILP
CPSC 614 Computer Architecture Lec 5 – Instruction Level Parallelism
Pipelining: Advanced ILP
Morgan Kaufmann Publishers The Processor
The EPIC-VLIW Approach
IA-64 Microarchitecture --- Itanium Processor
Lecture 6: Static ILP, Branch prediction
Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
The University of Adelaide, School of Computer Science
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
Lecture 23: Static Scheduling for High ILP
Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) HW3 posted, due in a week.
Henk Corporaal TUEindhoven 2011
Sampoorani, Sivakumar and Joshua
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
CPSC 614 Computer Architecture Lec 5 – Instruction Level Parallelism
Midterm 2 review Chapter
VLIW direct descendant of horizontal microprogramming
Chapter 3: ILP and Its Exploitation
CSC3050 – Computer Architecture
Overcoming Control Hazards with Dynamic Scheduling & Speculation
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Example Architectures 6th Apr, 2006

Anshul Kumar, CSE IITD slide 2 Execution model - some issues Register access within an instruction –interaction between reads and writes within an instruction to the same register Operation completion under exception –which operations are completed when an exception occurs Exposing pipeline latencies –what latency information the compiler has

Anshul Kumar, CSE IITD slide 3 Register access in an instruction Read sees the original value of the register –allows swap of two registers in a single instruction Read sees the value written by the write –a pair of operations that read and write a pair of registers can not be resolved Different operations that read and write the same register in an instruction are not allowed –parallel operations are not forced to execute in parallel

Anshul Kumar, CSE IITD slide 4 Operation completion under exception None complete All that can complete or all before the excepting operation complete Free-for-all Simplest Complex (determine what remains to be fixed up) No guarantees

Anshul Kumar, CSE IITD slide 5 Exposing pipeline latencies EQ model –the destination is written in a cycle which is known at compile time LEQ model –more permissive, allows some binary compatibility

Anshul Kumar, CSE IITD slide 6 VLIW Examples IA-64 and Itanium: HP and Intel Trimedia: Philips Transmeta Crusoe DSPs: Texas Instruments, Analog Devices

Anshul Kumar, CSE IITD slide 7 IA-64 Register Model 128 general purpose registers64 bit 128 floating point registers82 bit 64 predicate registers1 bit 8 branch registers (indirect branch)64 bit Registers for system control, memory mapping, performance counters, communication with OS

Anshul Kumar, CSE IITD slide 8 Register Stack GPRs 0-31 always available GPRs used as a stack GPRs and FPRs support register rotation for SW pipelining OUT LOCAL (frame i -1) OUT LOCAL (frame i)

Anshul Kumar, CSE IITD slide 9 IA-64 Execution Units ExecutionInstructionDescription UnitType I-unitAArithmetic (integer) I non-ALU int (shifts, tests, move) M-unitAArithmetic (integer) MMemory (load/store) F-unitFFloating point B-unitBBranches, calls, loops L+XL+XExtended immediates (executed by either B or I units)

Anshul Kumar, CSE IITD slide 10 Flexibility + explicit parallelism Compiler forms groups of instructions which can be executed in parallel if execution resources are available Instructions in a group may be scheduled in one or more cycles, depending upon resource availability

Anshul Kumar, CSE IITD slide 11 Instruction Formats Instructions are encoded in 128 bit bundles Each bundle = 5 bit template + 3  41 bit instruction 5 bit template field specifies execution unit types required for the 3 instructions and position of stops, if any stops indicate the boundaries of instruction groups

Anshul Kumar, CSE IITD slide 12 Template examples TemplateSlot 0Slot 1Slot 2 0 M I I 1 M I I 2 M I I 3 M I I 4 M L X 5 M L X 8 M M I 9 M M I

Anshul Kumar, CSE IITD slide 13 Example Schedule 1 TemplateSlot 0Slot 1Slot 2Cycle 9: MMILD F0,0(R1) LD F6,-8(R1) 1 14: MMFLD F10,-16(R1) LD F14,-24(R1) ADD F4,F0,F2 3 15: MMFLD F18,-32(R1) LD F22,-40(R1) ADD F8,F6,F2 4 15: MMFLD F26,-48(R1) SD F4,0(R1) ADD F12,F10,F2 6 15: MMFSD F8,-8(R1) SD F12,-16(R1) ADD F16,F14,F2 9 15: MMFSD F16,-24(R1) ADD F20,F18,F : MMFSD F20,-32(R1) ADD F24,F22,F : MMFSD F24,-40(R1) ADD F28,F26,F : MFBSD F28,-48(R1) ADD R1,R1,-56 BNE R1,R2,Loop 21

Anshul Kumar, CSE IITD slide 14 Example Schedule 2 TemplateSlot 0Slot 1Slot 2Cycle 8: MMILD F0,0(R1) LD F6,-8(R1) 1 9: MMILD F10,0(R1) LD F6,-8(R1) 2 14: MMFLD F18,-16(R1) LD F14,-24(R1) ADD F4,F0,F2 3 14: MMFLD F26,-16(R1) ADD F8,F10,F2 4 15: MMF ADD F12,F14,F2 5 14: MMFSD F4,0(R1) ADD F16,F18,F2 6 14: MMFSD F8,-8(R1) ADD F20,F14,F2 7 15: MMF SD F12,-16(R1) ADD F24,F22,F2 8 14: MMF SD F16,-24(R1) ADD F28,F26,F2 9 9: MMISD F20,-32(R1) SD F24,-40(R1) 11 28: MFBSD F28,-48(R1) ADD R1,R1,-56 BNE R1,R2,Loop 12

Anshul Kumar, CSE IITD slide 15 Predication Support Almost all instructions predicated 6 bit field specifies predicate register Predicate registers are set by test instructions

Anshul Kumar, CSE IITD slide 16 Speculation Support Control speculation using poison bit approach –One additional bit in GPRs - NaT (not a thing) –NaTVal in FPRs –Registers with NaT or NaTVal can’t be stored –special instructions to save and restore registers with poison bits/values Load/store speculation using advanced load instruction and ALAT table with associative look up

Anshul Kumar, CSE IITD slide 17 Itanium Processor Introduced in 2001 with 800MHz clock 3 level cache: first split, first 2 on-chip 2 I units, 2 M units, 3 B units, 2 F units 10 stage pipeline pre-fetch buffer with 8 bundles : 2 bundles pre- fetched per cycle up to 2 bundles issued at a time: up to 6 instructions distributed to 9 execution units, with register renaming (rotation and stacking) Good FP performance but not integer

Anshul Kumar, CSE IITD slide 18 Trimedia TM32 Designed for embedded applications Classic VLIW architecture, completely static scheduling 5 operation slots per instruction –each specifies an operation or immediate field no hazard detection hardware compressed code stored in memory and cache, decompressed during fetch each operation can be individually predicated in an instruction with multiple branches, at most one predicate can be true no virtual memory

Anshul Kumar, CSE IITD slide 19 Trimedia Function Units 23 function units of 11 different types –min latency 0 (integer ALU) –max latency 16 (FP divide and square root) a function unit can be specified by only certain instruction slots –ALU (all), DMem (4, 5), Branch (2, 3, 4), DSPALU (1, 3), FALU (1, 4), FTough (2)

Anshul Kumar, CSE IITD slide 20 Transmeta Crusoe Designed for low power applications like mobile PC, mobile internet appliances compatibility with x86 through translating software 500 MHz to 1 GHz, 5 to 7 W power consumption 64 bit (2 operations) and 128 bit (4 operations) versions, 64 integer registers [new 256 bit Efficeon] Operation slot types: ALU, compute (int/fp/mm), Memory, Branch, Immediate Support for speculative re-ordering: shadow register file, program-controlled store buffer, memory alias detection, conditional move