1 Lecture 18: VLIW and EPIC Static superscalar, VLIW, EPIC and Itanium Processor (First introduce fast and high- bandwidth L1 cache design)

Slides:

Advertisements

Similar presentations

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Advertisements

HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 19, 2005 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)

A scheme to overcome data hazards

1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

Instruction Level Parallelism 2. Superscalar and VLIW processors.

CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

FTC.W99 1 Advanced Pipelining and Instruction Level Parallelism (ILP) ILP: Overlap execution of unrelated instructions gcc 17% control transfer –5 instructions.

Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative.

Lecture 12 Reduce Miss Penalty and Hit Time

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Eliminating Stalls Using Compiler Support. Instruction Level Parallelism gcc 17% control transfer –5 instructions + 1 branch –Reordering among 5 instructions.

ILP: Loop UnrollingCSCE430/830 Instruction-level parallelism: Loop Unrolling CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

EEL Advanced Pipelining and Instruction Level Parallelism Lotzi Bölöni.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Computer Architecture Lec 8 – Instruction Level Parallelism.

Instruction Set Issues MIPS easy –Instructions are only committed at MEM  WB transition Other architectures are more difficult –Instructions may update.

Lecture 3: Chapter 2 Instruction Level Parallelism Dr. Eng. Amr T. Abdel-Hamid CSEN 601 Spring 2011 Computer Architecture Text book slides: Computer Architec.

Static Scheduling for ILP Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.

CS252 Graduate Computer Architecture Lecture 6 Static Scheduling, Scoreboard February 6 th, 2012 John Kubiatowicz Electrical Engineering and Computer Sciences.

DAP.F96 1 Lecture 4: Hazards, Introduction to Compiler Techniques, Chapter 2.

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster.

1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 9, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)

EECC551 - Shaaban #1 lec # 8 Winter Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to.

CIS 629 Fall 2002 Multiple Issue/Speculation Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to utilize.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

1 Chapter 2: ILP and Its Exploitation Review simple static pipeline ILP Overview Dynamic branch prediction Dynamic scheduling, out-of-order execution Hardware-based.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Recap Multicycle Operations –MIPS Floating Point Putting It All Together: the MIPS R4000 Pipeline.

Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.

LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,

IA64 Complier Optimizations Alex Bobrek Jonathan Bradbury.

1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.

Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

Use of Pipelining to Achieve CPI < 1

CS 352H: Computer Systems Architecture

Instruction Level Parallelism

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

CS203 – Advanced Computer Architecture

CS203 – Advanced Computer Architecture

Henk Corporaal TUEindhoven 2009

Pipelining: Advanced ILP

Morgan Kaufmann Publishers The Processor

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Lecture 23: Static Scheduling for High ILP

Henk Corporaal TUEindhoven 2011

Control unit extension for data hazards

Instruction Level Parallelism (ILP)

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Chapter 3: ILP and Its Exploitation

CSC3050 – Computer Architecture

CMSC 611: Advanced Computer Architecture

The University of Adelaide, School of Computer Science

Lecture 5: Pipeline Wrap-up, Static ILP

Presentation transcript:

1 Lecture 18: VLIW and EPIC Static superscalar, VLIW, EPIC and Itanium Processor (First introduce fast and high- bandwidth L1 cache design)

2 Fast Cache Hit by Virtual Cache Physical Cache – physically indexed and physically tagged cache Virtual Cache – virtually indexed and virtually tagged cache Must flush cache at process switch, or add PID Must handle virtual address alias to identical physical address TLB P-tag V-addr P-tagP-index P-addr data =? TLB V-tag P-index V-addr data =? P-addr To L2 Physical cache Virtual cache PID

3 Fast Cache Hits by Avoiding Translation: Process ID impact Black is uniprocess Light Gray is multiprocess when flush cache Dark Gray is multiprocess when use Process ID tag Y axis: Miss Rates up to 20% X axis: Cache size from 2 KB to 1024 KB

4 Virtually Indexed, Physically Tagged Cache What motivation? Fast cache hit by parallel TLB access No virtual cache shortcomings How could it be correct? Require cache way size <= page size; now physical index is from page offset Then virtual and physical indices are identical  works like a physically indexed cache! TLB P-tag V-index V-addr data =? P-addr What if want bigger caches? Higher associativity moves barrier to right Page coloring

5 Pipelined Cache Access For multi-issue, cache bandwidth affects effective cache hit time Queueing delay adds up if cache does not have enough read/write ports Pipelined cache accesses: reduce cache cycle time and improve bandwidth Cache organization for high bandwidth Duplicate cache Banked cache Double clocked cache Key technology: Wave pipelining that does not need latches

6 Pipelined Cache Access Alpha Data cache design The cache is 64KB, 2-way associative; cannot be accessed within one-cycle One-cycle used for address transfer and data transfer, pipelined with data array access Cache clock frequency doubles processor frequency; wave pipelined to achieve the speed

7 Trace Cache Trace: a dynamic sequence of instructions including taken branches Traces are dynamically constructed by processor hardware and frequently used traces are stored into trace cache Example: Intel P4 processor, storing about 12K mops (End of cache)

8 Two Paths to High ILP Modern superscalar processors: dynamically scheduled, speculative execution, branch prediction, dynamic memory disambiguation, non-blocking cache => More and more hardware functionalities AND complexities Another direction: Let complier take the complexity Simple hardware, smart compiler Static Superscalar, VLIW, EPIC

9 MIPS Pipeline with pipelined multi-cycle Operations IFID M1M2M3M4M5M6M7 EX A1A2A3A4 MWB DIV Page A-50, Figure A.31 Pipelined implementations ex: 7 outstanding MUL, 4 outstanding Add, unpipelined DIV. In-order execution, out-of-order completion Tomasulo w/o ROB: out-of-order execution, out-of-order completion, in-order commit

10 More Hazards Detection and Forwarding Assume checking hazards at ID (the simple way) Structural hazards Hazards at WB: Track usages of registers at ID, stall instructions if hazards is detected Separate int and fp registers to reduce hazards RAW hazards: Check source registers with all EX stages except the last ones. A dependent instruction must wait for the producing instruction to reach the last stage of EX Ex: check with ID/A1, A1/A2, A2/A3, but not A4/MEM. WAW hazards Instructions reach WB out-of-order check with all multi-cycle stages (A1-A4, D, M1-M7) for the same dest register Out-of-order completion complicates the maintenance of precise exception More forwarding data paths

11 Complier Optimization Example: add a scalar to a vector: for (i=1000; i>0; i=i–1) x[i] = x[i] + s; MIPS code Loop:L.DF0,0(R1);F0=vector element stall for L.D, assume 1 cycles ADD.D F4,F0,F2;add scalar from F2 stall for ADD, assume 2 cycles S.D0(R1),F4;store result DSUBUI R1,R1,8;decrement pointer BNEZR1,Loop;branch R1!=zero stall for taken branch, assume 1 cycle

12 1 Loop:L.DF0,0(R1) 2ADD.DF4,F0,F2 3S.D0(R1),F4 ;drop DSUBUI & BNEZ 4L.DF6,-8(R1) 5ADD.DF8,F6,F2 6S.D-8(R1),F8 ;drop DSUBUI & BNEZ 7L.DF10,-16(R1) 8ADD.DF12,F10,F2 9S.D-16(R1),F12 ;drop DSUBUI & BNEZ 10L.DF14,-24(R1) 11ADD.DF16,F14,F2 12S.D-24(R1),F16 13DSUBUIR1,R1,#32;alter to 4*8 14BNEZR1,LOOP 15NOP Loop unrolling 1 cycle stall 2 cycles stall

13 Unrolled Loop That Minimizes Stalls Called code movement Moving store past DSUBUI Moving loads before stores 1 Loop:L.DF0,0(R1) 2L.DF6,-8(R1) 3L.DF10,-16(R1) 4L.DF14,-24(R1) 5ADD.DF4,F0,F2 6ADD.DF8,F6,F2 7ADD.DF12,F10,F2 8ADD.DF16,F14,F2 9S.D0(R1),F4 10S.D-8(R1),F8 11S.D-16(R1),F12 12DSUBUIR1,R1,#32 13BNEZR1,LOOP 14S.D8(R1),F16; delayed branch slot

14 Register Renaming 1 Loop:L.DF0,0(R1) 2ADD.DF4,F0,F2 3S.D0(R1),F4 4L.DF0,-8(R1) 5ADD.DF4,F0,F2 6S.D-8(R1),F4 7L.DF0,-16(R1) 8ADD.DF4,F0,F2 9S.D-16(R1),F4 10L.DF0,-24(R1) 11ADD.DF4,F0,F2 12S.D-24(R1),F4 13DSUBUIR1,R1,#32 14BNEZR1,LOOP 15NOP 1 Loop:L.DF0,0(R1) 2ADD.DF4,F0,F2 3S.D0(R1),F4 4L.DF6,-8(R1) 5ADD.DF8,F6,F2 6S.D-8(R1),F8 7L.DF10,-16(R1) 8ADD.DF12,F10,F2 9S.D-16(R1),F12 10L.DF14,-24(R1) 11ADD.DF16,F14,F2 12S.D-24(R1),F16 13DSUBUIR1,R1,#32 14BNEZR1,LOOP 15NOP Original register renaming

15 VLIW: Very Large Instruction Word Static Superscalar: hardware detects hazard, complier determines scheduling VLIW: complier takes both jobs Each “instruction” has explicit coding for multiple operations There is no or only partial hardware hazard detection No dependence check logic for instruction issued at the same cycle Wide instruction format allows theoretically high ILP Tradeoff instruction space for simple decoding The long instruction word has room for many operations But have to fill with NOOP if no enough operations are found

16 VLIW Example: Loop Unrolling Memory MemoryFPFPInt. op/Clock reference 1reference 2operation 1 op. 2 branch L.D F0,0(R1)L.D F6,-8(R1)1 L.D F10,-16(R1)L.D F14,-24(R1)2 L.D F18,-32(R1)L.D F22,-40(R1)ADD.D F4,F0,F2ADD.D F8,F6,F23 L.D F26,-48(R1)ADD.D F12,F10,F2ADD.D F16,F14,F24 ADD.D F20,F18,F2ADD.D F24,F22,F25 S.D 0(R1),F4S.D -8(R1),F8ADD.D F28,F26,F26 S.D -16(R1),F12S.D -24(R1),F167 S.D -32(R1),F20S.D -40(R1),F24DSUBUI R1,R1,#488 S.D -0(R1),F28BNEZ R1,LOOP9 Unrolled 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per iteration (1.8X) Average: 2.5 ops per clock, 50% efficiency Note: Need more registers in VLIW (15 in this example)

17 Scheduling Across Branches Local scheduling or basic block scheduling Typically in a range of 5 to 20 instructions Unrolling may increase basic block size to facilitate scheduling However, what happens if branches exist in loop body? Global scheduling: moving instructions across branches (i.e., cross basic blocks) We cannot change data flow with any branch outputs How to guarantee correctness? Increase scheduling scope: trace scheduling, superblock, predicted execution, etc.

18 Problems with First Generation VLIW Increase in code size Wasted issue slots are translated to no-ops in instruction encoding  About 50% instructions are no-ops  The increase is from 5 instructions to 45 instructions! Operated in lock-step; no hazard detection HW Any function unit stalls  The entire processor stalls Compiler can schedule around function unit stalls, but how about cache miss? Compilers are not allowed to speculate! Binary code compatibility Re-compile if #FU or any FU latency changes; not a major problem today

19 EPIC/ IA-64: Motivation in 1989 “First, it was quite evident from Moore's law that it would soon be possible to fit an entire, highly parallel, ILP processor on a chip. Second, we believed that the ever-increasing complexity of superscalar processors would have a negative impact upon their clock rate, eventually leading to a leveling off of the rate of increase in microprocessor performance.” Schlansker and Rau, Computer Feb Obvious today: Think about the complexity of P4, 21264, and other superscalar processor; processor complexity has been discussed in many papers since mid-1990s Agarwal et al, "Clock rate versus IPC: The end of the road for conventional microarchitectures," ISCA 2000

20 EPIC, IA-64, and Itanium EPIC: Explicit Parallel Instruction Computing, an architecture framework proposed by HP IA-64: An architecture that HP and Intel developed under the EPIC framework Itanium: The first commercial processor that implements IA-64 architecture; now Itanium 2

21 EPIC Main ideas Compile does the scheduling Permitting the compiler to play the statistics (profiling) Hardware supports speculation Addressing the branch problem: predicted execution and many other techniques Addressing the memory problem: cache specifiers, prefetching, speculation on memory alias

22 Example: IF-conversion Use of predicated execution to perform if- conversion: Eliminate the branch and produces just one basic block containing operations guarded by the appropriate predicates. Schlansker and Rau, 2000 Example of Itanium code: cmp.eq p1, p2 = rl, r2;; (p1) sub r9 = r10, r11 (p2) add r5 = r6, r7

23 Memory Issues Cache specifiers: compiler indicates cache location in load/store; (use analytical models or profiling to find the answers?) Complier may actively remove data from cache or put data with poor locality into a special cache; reducing cache pollution Complier can speculate that memory alias does not exist thus it can reorder loads and stores Hardware detects any violations Compiler then fixes up

24 Itanium: First Implementation by Intel Itanium™ is name of first implementation of IA- 64 (2001) Highly parallel and deeply pipelined hardware at 800Mhz 6-issue, 10-stage pipeline at 800Mhz on 0.18 µ process bit integer registers bit floating point registers Hardware checks dependencies Predicated execution 128 bit Bundle: 5-bit template bit instructions - Two bundles can be issued together in Itanium

25 FPU IA-32 Control Instr. Fetch & Decode Cache TLB Integer Units IA-64 Control Bus Core Processor Die4 x 1MB L3 cache Itanium™ Processor Silicon (Copyright: Intel at Hotchips ’00) 25M xtors 4x75M xtors

26 Compare with P4 42M Xtors PIII: 26M 217 mm 2 PIII: 106 mm 2 L1 Execution Cache Buffer 12,000 Micro-Ops 8KB data cache 256KB L2$

27 Comments on Itanium Remarkably, the Itanium has many of the features more commonly associated with the dynamically-scheduled pipelines Performance: 800MHz Itanium, 1GHz 21264, 2GHz P4 SPEC Int: 85% 21264, 60% P4 SPEC FP: 108% P4, 120% Power consumption: 178% of P4 (watt per FP op) Surprising that an approach whose goal is to rely on compiler technology and simpler HW seems to be at least as complex as dynamically scheduled processors!