Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

Slides:



Advertisements
Similar presentations
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
Advertisements

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic
1 Lecture 18: VLIW and EPIC Static superscalar, VLIW, EPIC and Itanium Processor (First introduce fast and high- bandwidth L1 cache design)
Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Computer Organization and Architecture
Instruction-Level Parallelism (ILP)
DAP Spr.‘98 ©UCB 1 Lecture 6: ILP Techniques Contd. Laxmi N. Bhuyan CS 162 Spring 2003.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Computer Organization and Architecture
Instruction Level Parallelism (ILP) Colin Stevens.
Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.
Computer Organization and Architecture The CPU Structure.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.
CH12 CPU Structure and Function
IA-64 ISA A Summary JinLin Yang Phil Varner Shuoqi Li.
The Arrival of the 64bit CPUs - Itanium1 นายชนินท์วงษ์ใหญ่รหัส นายสุนัยสุขเอนกรหัส
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Example Architectures 6th Apr, 2006.
Hardware Support for Compiler Speculation
Introducing The IA-64 Architecture - Kalyan Gopavarapu - Kalyan Gopavarapu.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
StaticILP.1 2/12/02 Static ILP Static (Compiler Based) Scheduling Σημειώσεις UW-Madison Διαβάστε κεφ. 4 βιβλίο, και Paper on Itanium στην ιστοσελίδα.
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.
Recap Multicycle Operations –MIPS Floating Point Putting It All Together: the MIPS R4000 Pipeline.
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
Unit II Intel IA-64 and Itanium Processor By N.R.Rejin Paul Lecturer/VIT/CSE CS2354 Advanced Computer Architecture.
IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.
Use of Pipelining to Achieve CPI < 1
CS 352H: Computer Systems Architecture
William Stallings Computer Organization and Architecture 8th Edition
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Henk Corporaal TUEindhoven 2009
Morgan Kaufmann Publishers The Processor
The EPIC-VLIW Approach
Yingmin Li Ting Yan Qi Zhao
Lecture 23: Static Scheduling for High ILP
Chapter Six.
Advanced Computer Architecture
Henk Corporaal TUEindhoven 2011
Control unit extension for data hazards
Sampoorani, Sivakumar and Joshua
Instruction Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Control unit extension for data hazards
CSC3050 – Computer Architecture
How to improve (decrease) CPI
Control unit extension for data hazards
Presentation transcript:

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW (3 operations) today change in the instruction set architecture, i.e., 1 program counter points to 1 bundle (not 1 operation) operations in a bundle issue in parallel fixed format so could decode operations in parallel enough FUs for types of operations that can issue in parallel pipelined FUs Machines: Multiflow & Cydra 5 (8 to 16 operations) in the 1980’s IA-64 (3 operations), Crusoe (4 operations), TM32 (5 operations) today

Spring 2003CSE P5482 VLIW Processors Goal of the hardware design: reduce hardware complexity to shorten the cycle time for better performance to reduce power requirements How VLIW designs reduce hardware complexity less multiple-issue hardware no dependence checking for instructions within a bundle can be fewer paths between instruction issue slots & FUs simpler instruction dispatch no out-of-order execution, no instruction grouping ideally no structural hazard checking logic Reduction in hardware complexity affects cycle time & power consumption

Spring 2003CSE P5483 VLIW Processors Compiler support to increase ILP compiler creates each VLIW word need for good code scheduling greater than with in-order issue superscalars instruction doesn’t issue if 1 operation can’t techniques for increasing ILP loop unrolling software pipelining (schedules instructions from different iterations together) aggressive inlining (function becomes part of the caller code) trace scheduling (schedule beyond basic block boundaries)

Spring 2003CSE P5484 VLIW Processors More compiler support to increase ILP detects hazards & hides latencies structural hazards no 2 operations to the same functional unit no 2 operations to the same memory bank hiding latencies data prefetching hoisting loads above stores data hazards no data hazards among instructions in a bundle control hazards predicated execution static branch prediction

Spring 2003CSE P5485 IA-64 EPIC Explicitly Parallel Instruction Computing, aka VLIW MHz Itanium IA-64 implementation Bundle of instructions 128 bit bundles 3 41-bit instructions/bundle 2 bundles can be issued at once if issue one, get another less delay in bundle issue than style slotting

Spring 2003CSE P5486 IA-64 EPIC Registers 128 integer & FP registers implications for architecture? 128 additional registers for loop unrolling & similar optimizations implications for hardware? 8 indirect branch registers miscellaneous other registers implications for performance? + -

Spring 2003CSE P5487 IA-64 EPIC 2 Full predicated execution supported by 64 one-bit predicate registers instruction that sets 2 at once (comparison result & complement) example cmp.eq r1, r2, p1, p2 (p1) sub 59, r10, r11 (p2) add r5, r6, r7

Spring 2003CSE P5488 IA-64 EPIC 2 Full predicated execution implications for architecture? implications for the hardware? implications for exploiting ILP?

Spring 2003CSE P5489 IA-64 EPIC Template in each bundle that indicates: type of operation for each instruction instruction order examples (2 of 24) M: load & manipulate the address (e.g., increment an index) I: integer ALU op F: FP op B: transfer of control other, e.g., stop (see below) restrictions on which instructions can be in which slots a stop bit that delineates the instructions that can execute in parallel all instructions before a stop have no data dependences

Spring 2003CSE P54810 IA-64 EPIC Template, cont’d. schedule for functional unit availability (I.e., template types) & latencies implications for hardware: no instruction grouping potentially fewer paths between issue slots & functional units potentially no structural hazard checks hardware not have to determine intra-bundle data dependences

Spring 2003CSE P54811 IA-64 EPIC Branch support full predicated execution branch prediction instruction PC of branch instruction branch prediction target forecasting new wrinkle on confidence hierarchy of branch prediction structures in different pipeline stages 4-target BTB for repeatedly executed taken branches an instruction puts a specific target in it (exposed to the architecture) 0-cycle execution if predict taken & correct larger back-up BTB 2-level branch prediction for hard-to-predict branches instruction hint that branches that are statically easy-to- predict should not be placed in it private history registers, 4 history bits, shared PHTs separate 2-level structure for multiway branches

Spring 2003CSE P54812 IA-64 EPIC Still seems complicated compatibility IA-32 PA-RISC compatible memory model (segments)

Spring 2003CSE P54813 IA-64 EPIC ISA & microarchitecture seem complicated (some features of out-of-order processors) not all instructions in a bundle need stall if one stalls (a scoreboard keeps track of produced source operands) multi-level branch prediction register remapping to support rotating registers on the “register stack” which aid in software pipelining & dynamically sized register windows special hardware for “register window” overflow detection; special instructions for saving & restoring the register stack speculative state cannot be stored to memory special instructions check integer register poison bits to detect whether value is speculative (speculative loads or exceptions) OS can override the ban (e.g., for a context switch) different mechanism for floating point (status registers) array address post-increment & loop control

Spring 2003CSE P54814 Trimedia TM32 Classic VLIW no hazard detection in hardware nops “guarantee” that dependences are followed instructions decompressed on fetching

Spring 2003CSE P54815 Superscalars vs. VLIW Superscalar has more complex hardware for instruction scheduling instruction slotting or out-of-order hardware more paths between instruction issue structure & functional units possible consequences: slower cycle times more chip real estate more power consumption but VLIW has more functional units if supports full predication possible consequences: slower cycle times more chip real estate more power consumption

Spring 2003CSE P54816 Superscalars vs. VLIW VLIW has larger code size estimates of IA-64 code of up to 2X - 4X over x86 128b holds 4 (not 3) instructions on a RISC superscalar sometimes nops if don’t have an instruction of the correct type branch targets must be at the beginning of a bundle predicated execution to avoid branches extra, special instructions check for exceptions & improper load hoisting allocate registers for local variables ala register windows branch prediction consequences: increase in instruction bandwidth requirements decrease in instruction cache effectiveness

Spring 2003CSE P54817 Superscalars vs. VLIW VLIW requires a more complex compiler Superscalars can more efficiently execute pipeline-independent code consequence: don’t have to recompile if change the implementation What else?