Henk Corporaal TUEindhoven 2011

Slides:

Advertisements

Similar presentations

CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading.

Advertisements

Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

Loop Unrolling & Predication CSE 820. Michigan State University Computer Science and Engineering Software Pipelining With software pipelining a reorganized.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

1 Lecture 18: VLIW and EPIC Static superscalar, VLIW, EPIC and Itanium Processor (First introduce fast and high- bandwidth L1 cache design)

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

IA-64 Register Model: Stack & Rotation Dale Morris Architect Hewlett Packard Co.

1 Lecture 7: Static ILP, Branch prediction Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )

Instruction Level Parallelism (ILP) Colin Stevens.

Chapter 15 IA-64 Architecture No HW, Concentrate on understanding these slides Next Monday we will talk about: Microprogramming of Computer Control units.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

Performance Potentials of Compiler- directed Data Speculation Author: Youfeng Wu, Li-Ling Chen, Roy Ju, Jesse Fang Programming Systems Research Lab Intel.

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

NYU DARPA DIS kick-off September 24, Comparing IA-64 and HPL-PD NYU.

Chapter 15 IA-64 Architecture. Reflection on Superscalar Machines Superscaler Machine: A Superscalar machine employs multiple independent pipelines to.

Chapter 21 IA-64 Architecture (Think Intel Itanium)

IA-64 Architecture (Think Intel Itanium) also known as (EPIC – Extremely Parallel Instruction Computing) a new kind of superscalar computer HW 5 - Due.

Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)

IA-64 ISA A Summary JinLin Yang Phil Varner Shuoqi Li.

 Arun Hariharan (N.M.S.U). MOTIVATION  Need for high speed computing and Architecture More complex compilers (JAVA) Large Database Systems Distributed.

Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Example Architectures 6th Apr, 2006.

10/27: Lecture Topics Survey results Current Architectural Trends Operating Systems Intro –What is an OS? –Issues in operating systems.

Hardware Support for Compiler Speculation

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

Introducing The IA-64 Architecture - Kalyan Gopavarapu - Kalyan Gopavarapu.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

StaticILP.1 2/12/02 Static ILP Static (Compiler Based) Scheduling Σημειώσεις UW-Madison Διαβάστε κεφ. 4 βιβλίο, και Paper on Itanium στην ιστοσελίδα.

Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.

1 Lecture 12: Advanced Static ILP Topics: parallel loops, software speculation (Sections )

Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.

Unit II Intel IA-64 and Itanium Processor By N.R.Rejin Paul Lecturer/VIT/CSE CS2354 Advanced Computer Architecture.

IA64 Complier Optimizations Alex Bobrek Jonathan Bradbury.

IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.

CS 352H: Computer Systems Architecture

Dynamic Scheduling Why go out of style?

VLIW Architecture FK Boachie..

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

Simultaneous Multithreading

CS203 – Advanced Computer Architecture

CC 423: Advanced Computer Architecture Limits to ILP

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

/ Computer Architecture and Design

Henk Corporaal TUEindhoven 2009

Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches Henk Corporaal TUEindhoven.

Instructional Parallelism

Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Lecture 6: Static ILP, Branch prediction

Yingmin Li Ting Yan Qi Zhao

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)

Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)

Lecture 23: Static Scheduling for High ILP

Sampoorani, Sivakumar and Joshua

Instruction Level Parallelism (ILP)

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

CSC3050 – Computer Architecture

Dynamic Hardware Prediction

Loop-Level Parallelism

Lecture 5: Pipeline Wrap-up, Static ILP

Presentation transcript:

Henk Corporaal www.ics.ele.tue.nl/~heco TUEindhoven 2011 Advanced Computer Architecture 5MD00 / 5Z033 EPIC / Itanium architecture best of both worlds Henk Corporaal www.ics.ele.tue.nl/~heco TUEindhoven 2011

Avoiding superscalar complexity An alternative: EPIC (explicit parallel instruction computer) EPIC: Best of both worlds? Superscalar: expensive but binary compatible VLIW: simple, but not compatible Or: use VLIW with Binary translation at Run-time Transmeta: Crusoe VLIW processor Runs x86 code on a VLIW !!! 1/2/2019 ACA H.Corporaal

EPIC Architecture: IA-64 / Itanium Explicit Parallel Instruction Computer Architecture IA-64 -> now called Itanium Implementations: Merced (2001), McKinley (2002) Montecite (2 core, 4 way multi-threading, 2x12MB L3, 596 mm2, 90nm,2006) Tukwila (4-core, 65nm, 699 mm2, 24MB L3, 2010) Poulson (8-core, 32nm, 3 Billion trans, 48 MB L3 cache, 544 mm2, 4 way hyperthreading/core, 12-issue/core, 2012 ) Kittson (?? 2014) 1/2/2019 ACA H.Corporaal

(2002) 1/2/2019 ACA H.Corporaal

Itanium: Register model 128 64-bit integer stack and rotating register file support 128 82-bit floating point, rotating 64 1-bit booleans 8 64-bit branch target address system control registers 1/2/2019 ACA H.Corporaal

Itanium Instruction format Instructions grouped in 128-bit bundles 3 * 41-bit instruction 5 template bits, indicate type and stop location Each 41-bit instruction starts with 4-bit opcode, and ends with 6-bit guard (boolean) register-id 5 41 41 41 1/2/2019 ACA H.Corporaal

1/2/2019 ACA H.Corporaal

Predication Predicated execution of virtually all instructions (p) add r1 = r2, r3 If p is true, normal add operation. Otherwise, NOP 64 1-bit predicate registers Advantages of predicated execution: Remove branches Convert control dependence to data dependence Reduce misprediction penalties Increase the size of basic block Both codes from taken & not-taken path can be scheduled in the same cycle 1/2/2019 ACA H.Corporaal

Control Speculation Loads incur high latency Need to schedule loads as early as possible Two barriers – branches and stores Control speculation – move loads above branches: 1/2/2019 ACA H.Corporaal

Control speculation – move loads above branches Problem: loads can cause exceptions Separate load behavior from exception behavior Speculative load (ld.s) initiates a load & detects exceptions On an exception, hardware propagates exception token (stored with destination register) from ld.s to chk.s Speculative check (chk.s) delivers the exception detected by ld.s 1/2/2019 ACA H.Corporaal

Control Speculation Control speculating uses further increase ILP Dependent instructions following the load can also be speculated above branches 1/2/2019 ACA H.Corporaal

Data Speculation Loads and previous stores can conflict When the loads/stores overlap (access the same memory location), the loads must wait for previous stores due to RAW dependence IA-64 enables data speculation by ld.a and ld.c/chk.a with ALAT (Advanced Load Address Table): ld. a performs a normal load and inserts the address to ALAT Any intervening stores eliminate the overlapping entries from ALAT The advanced load check (ld.c) checks ALAT whether there was a violation and reissues the load if necessary 1/2/2019 ACA H.Corporaal

Data Speculation Move loads above potentially overlapping stores 1/2/2019 ACA H.Corporaal

Data Speculation Uses of speculative data can be further speculated Also, control and data speculation can be combined Schedule loads across branches and across stores at the same time Speculative advanced loads – ld.sa combines the semantics of ld.a and ld.s 1/2/2019 ACA H.Corporaal

Register Stack Procedure call overhead Register Stack Spill registers to memory on call Restore them on procedure return Register Stack Register stack is used to save/restore procedure contexts across calls Stack area in memory to save/restore procedure context Explicit allocation of stack frames Effective use of 96 registers Allocate only what is needed Overlapping stack frames avoids parameter copying Mechanism implemented by renaming register addresses 1/2/2019 ACA H.Corporaal

Register Stack 1/2/2019 ACA H.Corporaal

Register Stack Engine (RSE) Automatically saves/restores stack registers without software intervention Avoids explicit spill/fill (Eliminates stack management overhead) Provides the illusion of infinite physical registers RSE uses unused memory bandwidth (cycle stealing) to perform register spill and fill operations in the background Overflow: alloc needs more registers than available Underflow: return needs to restore frame saved in memory 1/2/2019 ACA H.Corporaal

Software Pipelining Support High performance loops without code size overhead No prologue and epilogue Rotating registers Provide automatic renaming Rotating predicates (stage predicates) Unify prologue, kernel, and epilogue Loop control registers (LC, EC) Loop branches Counted loop (br.ctop) While loop (br.wtop) Especially valuable for integer loops with small trip counts 1/2/2019 ACA H.Corporaal

Software Pipelining Example ld Prolog ld add ld st add ld Kernel st add ld st add Epilog st add st L1: ld4 r4 = [r5], 4 //Cycle 0 add r7 = r4, r9 //Cycle 2 st4 [r6] = r7, 4 //Cycle 3 br.cloop L1;; L1: (p16) ld4 r32 = [r5], 4 // Cycle 0 (p18) add r35 = r34, r9 // Cycle 0 (p19) st4 [r6] = r36, 4 // Cycle 0 br.ctop L1 // Cycle 0 What happens during runtime? Iteration1 r32 r33 r34 r35 … p16 p17 p18 p19 .. 1 0 0 0 .. Iteration2 r33 r34 r35 r36 … p17 p18 p19 .. p16 1 0 0 .. 1 Iteration3 r34 r35 r36 r37 … p18 p19 .. p16 p17 1 0 .. 1 1 1/2/2019 ACA H.Corporaal

IA-64 / Itanium architecture: a VLIW? Yes, but: Instructions contain only one operation; compiler can indicate that successive instructions can be executed in parallel HW does the (Operation – FU) binding Pipeline latencies not visible in the ISA These measures make the ISA independent of #FUs and pipeline latencies  ISA supports multiple implementations 1/2/2019 ACA H.Corporaal

HW vs SW scheduling + binding? Architecture options Scheduling operations Binding operations to FUs HW/SW HW SW O-o-O Superscalar Itanium TRIPS VLIW 1/2/2019 ACA H.Corporaal

Montecito 2006: dual 11-issue cores 1/2/2019 ACA H.Corporaal

Tukwila 4 core Itanium, 2010 1/2/2019 ACA H.Corporaal